Compiling nwchem-6.3 in a contemporary HPC with Xeon PHI

Click here for full thread
Clicked A Few Times
Comparing armci comex threading and PHI
After the QA tests (which I recommend comparing by hand, sometimes the "Failure" is a false alarm) I've done a small and straightforward benchmark using only 1 node, and I find the results surprising

Average timings for "armci" "armci+openmp" and "comex" libraries. Exactly the same job script. SCF + DFT energy job in a medium sized system. 1 node (40 processors, 64 gb ram), 10 processors used, OMP_NUM_THREADS=4 when applicable.
armci                    :Total times  cpu:     6401.5s     wall:     6414.5s
armci + openmp    :Total times  cpu:     6649.3s     wall:     6382.4s
comex                  :Total times  cpu:    31301.5s     wall:    31351.1s
armci + openmp with phi   :Total times  cpu:     6610.3s     wall:     6352.4s

1. Nothing is offloaded to the phi card
2. A sampling of the process for 15 minutes showed no threading. At least dgemm should have threaded within this time period.
3. COMEX binary in a single node is extremely slow (~5x)

I really don't know what went wrong. The binary seems to contain correct references, but are they called?

Is it normal for COMEX to be this slow?

Environment settings for Intel MIC automatic offload
Dr. O. Baris Malcioglu,
University of Liege,
Bât. B5 Physique de la matière condensée
allée du 6 Août 17
4000 Liège 1