Progress/Plan for optimizing NWChem to Xeon Phi Knights Landing?


Click here for full thread
Gets Around
Quote:Chiensh Mar 15th 11:29 am

May I know is there any update on the nwchem development for KNL (socket based version)?


All the work I know of pertains to NWPW and CC modules.

Quote:Chiensh Mar 15th 11:29 am

We have a 144-node KNL system and I managed to compiled a copy of NWChem with ARMCI-MPi using intel compiler and impi, OpenMP is enabled and MIC-AVX512 was added in the compilers flags.


Please also try ARMCI_NETWORK=MPI-PR.

When you run with ARMCI-MPI, please set ARMCI_USE_WIN_ALLOCATE=1 in your environment or manually configure with --enable-win-allocate.

If you use ARMCI-MPI, it often helps to use http://www.mcs.anl.gov/project/casper/ as well. Write Casper user list for assistance if necessary. We will be updating the docs related to NWChem very soon.

Quote:Chiensh Mar 15th 11:29 am

When I run a C240 dft benchmark job across the full system,

(1) I run 144 MPI task (1 per node) and OMP_NUM_THREADS=64, it works, but is not impressingly fast, I noted that no more then 1 core in each node (or socket) has ever been used (I supposed OMP/hybrid parallization has not been implemented for DFT yet, please correct me if I am wrong)


This is only going to use OpenMP in MKL calls and the benefit will be small. Please run with OMP_NUM_THREADS=1 unless you use NWPW or CC.

Quote:Chiensh Mar 15th 11:29 am

(2) I run the same job with 9216 MPI tasks instead (1 task per core), but it just hang after printing out the basis set information
Summary of "ao basis" -> "ao basis" (cartesian)
------------------------------------------------------------------------------
Tag Description Shells Functions and Types
---------------- ------------------------------ ------ ---------------------
C user specified 6 15 3s2p1d
(hang here)


C240 isn't big enough for that many MPI ranks. Try running 32 ranks per node on 1-16 nodes.

In generally, it is imprudent to start with full-machine jobs. Run on one node and scale up slowly.

Quote:Chiensh Mar 15th 11:29 am

Can anyone suggested me the best way to build nwchem on a KNL system with EDR IB connected, i.e. what is the choice of ARMCI_NETWORK, what combination of MKL, LAPACK and Scalapack, as well as how to use the MIC-AVX512 instruction. In addition also please suggest the best way to run nwchem on this system (i.e. pure MPI or hybrid MPI-OMP ?)


It looks fine. I think we found MPI-PR was better for EDR IB but only a small amount.