After the QA tests (which I recommend comparing by hand, sometimes the "Failure" is a false alarm) I've done a small and straightforward benchmark using only 1 node, and I find the results surprising
Average timings for "armci" "armci+openmp" and "comex" libraries. Exactly the same job script. SCF + DFT energy job in a medium sized system. 1 node (40 processors, 64 gb ram), 10 processors used, OMP_NUM_THREADS=4 when applicable.
Time
armci :Total times cpu: 6401.5s wall: 6414.5s
armci + openmp :Total times cpu: 6649.3s wall: 6382.4s
comex :Total times cpu: 31301.5s wall: 31351.1s
armci + openmp with phi :Total times cpu: 6610.3s wall: 6352.4s
1. Nothing is offloaded to the phi card
2. A sampling of the process for 15 minutes showed no threading. At least dgemm should have threaded within this time period.
3. COMEX binary in a single node is extremely slow (~5x)
I really don't know what went wrong. The binary seems to contain correct references, but are they called?
Is it normal for COMEX to be this slow?
Environment settings for Intel MIC automatic offload
OFFLOAD_REPORT=2
MKL_MIC_ENABLE=1
|