SCF Performance for Different ARMCI Network on Socket-based KNL Cluster

Click here for full thread

Clicked A Few Times

9:08:29 PM PDT - Fri, Mar 31st 2017
Thank you Edo, (1) What is the suitable combinations of OMP_NUM_THREADS and MKL_NUM_THREADS on a 64 cores KML systems? I believe thread based parallelization is used by SCF through MKL, and based on my observation, if I assign 1 MPI rank (for OpenIB), or 2 MPI ranks (for MPI-PR), on each node, it will not be used for more than 20 cores regardless the setting of MKL_NUM_THREADS (I set it 60). In other words, I have to assign more MPI rank on each MKLnode to fully utilize all 64 cores... Update: I assign 5+1 MPI ranks on each node, and it roughly speeds up by 5 times in SCF. Am I suppose to run 60 MPI ranks on each node? (2) For ARMCI-MPI+CASPER, I observed that SCF (and subsequent integral transformation for CCSD calculation) was run only 1 thread on each node, that's the reason why it is 4 times slower than OpenIB and MPI-PR. how do I enable the multi threads parallelization for AMRCI-MPI+Casper? (3) For OpenIB and MPI-PR, the SCF and CCSD calculations are finished in half an hours on 20 KNL nodes, but it was stuck in the (T) calculation, it could not finish in 20 hours. Iterations converged CCSD correlation energy / hartree = -2.916922299620284 CCSD total energy / hartree = -844.252405540227414 Singles contributions Doubles contributions CCSD(T) Using plain CCSD(T) code Using sliced CCSD(T) code If I disable the sliced (T) algorithm, there will be a "ccsd_t: MA error sgl 17596287801", I believe it is due to insufficient of local memory. 3 weeks ago, Thomas Dunning has presented the NWChemEX project at a Singapore conference, and he mentioned that they have achieved 1 PF/s on 20K nodes in Blue Water with a more efficient (T) algorithm, has this been implemented to the current version of NWChem? Thanks! ~Dominic