Progress/Plan for optimizing NWChem to Xeon Phi Knights Landing?

Just Got Here

9:08:57 PM PST - Tue, Dec 8th 2015
I'm developing HPC system products that focus on Knights Landing (KNL), the next generation Xeon Phi product from Intel. To achieve higher computational performance with KNL, it is very important to optimize application softwares to KNL. So, I am investigating progress of such optimizations of various HPC applications. I'd like to help customers implement HPC systems in a planned way, by providing information on development progress of KNL-related products. Would you please let me know your progress or plan of optimization of NWChem to KNL ? Best regards, HIROMASA

Gets Around

12:25:43 PM PST - Wed, Dec 9th 2015
Hiromasa, Eric Bylaska

Just Got Here

4:58:03 PM PST - Wed, Dec 9th 2015
Bylaska, Thank you very much for sharing NWChem upgrade status in detail. I wish great success of your works. Thank you, HIROMASA

Gets Around

3:30:20 PM PDT - Wed, Sep 21st 2016
Because KNL is binary-compatible with Xeon processors in the Haswell/Broadwell generation, every feature should be functional, although I cannot remember if I have run the QA suite or not. I was able to build and run NWChem on the first attempt with pre-production hardware. As Eric said, PNNL, LBNL and Intel are collaborating [1,2] on the optimization of NWChem, especially NWPW and TCE. If you need performance numbers for business purposes, please contact me privately (I work for Intel). [1] https://software.intel.com/en-us/articles/ipcc-at-environmental-molecular-sciences-laborat... [2] https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-lawrence-berk...

Clicked A Few Times

4:29:41 AM PDT - Wed, Mar 15th 2017
Hi All, May I know is there any update on the nwchem development for KNL (socket based version)? We have a 144-node KNL system and I managed to compiled a copy of NWChem with ARMCI-MPi using intel compiler and impi, OpenMP is enabled and MIC-AVX512 was added in the compilers flags. When I run a C240 dft benchmark job across the full system, (1) I run 144 MPI task (1 per node) and OMP_NUM_THREADS=64, it works, but is not impressingly fast, I noted that no more then 1 core in each node (or socket) has ever been used (I supposed OMP/hybrid parallization has not been implemented for DFT yet, please correct me if I am wrong) (2) I run the same job with 9216 MPI tasks instead (1 task per core), but it just hang after printing out the basis set information Summary of "ao basis" -> "ao basis" (cartesian) ------------------------------------------------------------------------------ Tag Description Shells Functions and Types ---------------- ------------------------------ ------ --------------------- C user specified 6 15 3s2p1d (hang here) Can anyone suggested me the best way to build nwchem on a KNL system with EDR IB connected, i.e. what is the choice of ARMCI_NETWORK, what combination of MKL, LAPACK and Scalapack, as well as how to use the MIC-AVX512 instruction. In addition also please suggest the best way to run nwchem on this system (i.e. pure MPI or hybrid MPI-OMP ?) Thanks a lot! PS I just found this section of the compilation instruction, but welcome if there are more hints to me, Thanks! http://nwchemgit.github.io/index.php/Compiling_NWChem#How-to:_Intel_Xeon_Phi ~ Dominic Chien

Gets Around

9:53:50 PM PDT - Wed, Mar 15th 2017
Quote:Chiensh Mar 15th 11:29 am May I know is there any update on the nwchem development for KNL (socket based version)? All the work I know of pertains to NWPW and CC modules. Quote:Chiensh Mar 15th 11:29 am We have a 144-node KNL system and I managed to compiled a copy of NWChem with ARMCI-MPi using intel compiler and impi, OpenMP is enabled and MIC-AVX512 was added in the compilers flags. Please also try ARMCI_NETWORK=MPI-PR. When you run with ARMCI-MPI, please set ARMCI_USE_WIN_ALLOCATE=1 in your environment or manually configure with --enable-win-allocate. If you use ARMCI-MPI, it often helps to use http://www.mcs.anl.gov/project/casper/ as well. Write Casper user list for assistance if necessary. We will be updating the docs related to NWChem very soon. Quote:Chiensh Mar 15th 11:29 am When I run a C240 dft benchmark job across the full system, (1) I run 144 MPI task (1 per node) and OMP_NUM_THREADS=64, it works, but is not impressingly fast, I noted that no more then 1 core in each node (or socket) has ever been used (I supposed OMP/hybrid parallization has not been implemented for DFT yet, please correct me if I am wrong) This is only going to use OpenMP in MKL calls and the benefit will be small. Please run with OMP_NUM_THREADS=1 unless you use NWPW or CC. Quote:Chiensh Mar 15th 11:29 am (2) I run the same job with 9216 MPI tasks instead (1 task per core), but it just hang after printing out the basis set information Summary of "ao basis" -> "ao basis" (cartesian) ------------------------------------------------------------------------------ Tag Description Shells Functions and Types ---------------- ------------------------------ ------ --------------------- C user specified 6 15 3s2p1d (hang here) C240 isn't big enough for that many MPI ranks. Try running 32 ranks per node on 1-16 nodes. In generally, it is imprudent to start with full-machine jobs. Run on one node and scale up slowly. Quote:Chiensh Mar 15th 11:29 am Can anyone suggested me the best way to build nwchem on a KNL system with EDR IB connected, i.e. what is the choice of ARMCI_NETWORK, what combination of MKL, LAPACK and Scalapack, as well as how to use the MIC-AVX512 instruction. In addition also please suggest the best way to run nwchem on this system (i.e. pure MPI or hybrid MPI-OMP ?) It looks fine. I think we found MPI-PR was better for EDR IB but only a small amount.

Clicked A Few Times

12:29:31 AM PDT - Mon, Mar 20th 2017

Than you Jeff

Quote:Jhammond Mar 15th 9:53 pm

All the work I know of pertains to NWPW and CC modules.

That also what I expect, and I understand that these 2 methods are so important for NWChem and they have highest priority to be ported over this architecture, however HF and DFT scf are the fundamental part of almost all calculations, so I hope OMP/MPI hybrid parallelizations will be available ASAP.

Quote:Jhammond Mar 15th 9:53 pm

Please also try ARMCI_NETWORK=MPI-PR.

When you run with ARMCI-MPI, please set ARMCI_USE_WIN_ALLOCATE=1 in your environment or manually configure with --enable-win-allocate.

If you use ARMCI-MPI, it often helps to use http://www.mcs.anl.gov/project/casper/ as well. Write Casper user list for assistance if necessary. We will be updating the docs related to NWChem very soon.

Thank you, and I will try all these network setting and will report my finding later.

Quote:Jhammond Mar 15th 9:53 pm

This is only going to use OpenMP in MKL calls and the benefit will be small. Please run with OMP_NUM_THREADS=1 unless you use NWPW or CC.

Quote:Jhammond Mar 15th 9:53 pm

C240 isn't big enough for that many MPI ranks. Try running 32 ranks per node on 1-16 nodes.

In generally, it is imprudent to start with full-machine jobs. Run on one node and scale up slowly.

I see, and I already finished the smaller calculations, and expect will be inefficient to run on such a large MPI rank, bud I did not expect it simply hang ... ....

Quote:Jhammond Mar 15th 9:53 pm

It looks fine. I think we found MPI-PR was better for EDR IB but only a small amount.

I also noticed that the current compiling document for KNL is confusing (http://nwchemgit.github.io/index.php/Compiling_NWChem#How-to:_Intel_Xeon_Phi):

...

This section describes both the newer KNL and older KNC hardware, in reverse chronological order.

Compiling NWChem on self-hosted Intel Xeon Phi Knights Landing processors

NWChem 6.6 (and later versions) support OpenMP threading, which is essential to obtaining good performance with NWChem  on Intel Xeon Phi many-core processors. 

As of November 2016, the development version of NWChem contains threading support in the TCE coupled-cluster codes  (primarily non-iterative triples in e.g. CCSD(T)), semi-direct CCSD(T), and plane-wave DFT (i.e. NWPW).

...

in the document, it stated that these option will enable the optimization for CCSD and NWPW on KNL

% setenv USE_OPENMP 1

% setenv USE_F90_ALLOCATABLE T

% setenv USE_FASTMEM T

however, enabling the USE_F90_ALLOCATABLE flag for stable version of NWChem 6.6 will cause compilation error:

ccsd_t2_8.F(489): error #6404: This name does not have a type, and must have an explicit type.   [L_A]

      if (e_a) call errquit("MA pops a",l_a,MA_ERR)

----------------------------------------^

ccsd_t2_8.F(490): error #6404: This name does not have a type, and must have an explicit type.   [L_T]

      if (e_t) call errquit("MA pops t",l_t,MA_ERR)

----------------------------------------^

compilation aborted for ccsd_t2_8.F (code 1)

make[3]: *** [/home/users/astar/ihpc/chiensh/nwchem-6.6/lib/LINUX64/libtce.a(ccsd_t2_8.o)] Error 1

make[3]: *** Waiting for unfinished jobs....

=========================================================

beacause l_a and l_t are not defined when USE_F90_ALLOCATABLE is enabled.

...

473 #ifdef USE_F90_ALLOCATABLE

474       deallocate(f_a,stat=e_a)

475       deallocate(f_b,stat=e_b)

476       deallocate(f_c,stat=e_c)

477 # ifndef USE_LOOPS_NOT_DGEMM

478       deallocate(f_t,stat=e_t)

479 # endif

480 #else

481 # ifndef USE_LOOPS_NOT_DGEMM

482       e_t=.not.MA_POP_STACK(l_t)

483 # else

484       l_t=-12345

485       e_t=.false.

486 # endif

487       e_a=.not.MA_chOP_STACK(l_a)

488 #endif

489       if (e_a) call errquit("MA pops a",l_a,MA_ERR)

490       if (e_t) call errquit("MA pops t",l_t,MA_ERR)

491       RETURN

492       END

...

Gets Around

9:10:13 AM PDT - Mon, Mar 20th 2017
Quote:Chiensh Mar 20th 7:29 am Than you Jeff Quote:Jhammond Mar 15th 9:53 pm All the work I know of pertains to NWPW and CC modules. That also what I expect, and I understand that these 2 methods are so important for NWChem and they have highest priority to be ported over this architecture, however HF and DFT scf are the fundamental part of almost all calculations, so I hope OMP/MPI hybrid parallelizations will be available ASAP. We understand this. However, SCF calculations bottleneck in atomic integrals. The NWChem atomic integral library is fast on standard server hardware (e.g. Xeon) but it is not vectorized nor is it threaded. It's not even thread-safe, either, so we either need to rewrite most of the code or refactor the code to use another atomic integral library. Neither of these efforts are easy. Quote:Chiensh Mar 20th 7:29 am Quote:Jhammond Mar 15th 9:53 pm C240 isn't big enough for that many MPI ranks. Try running 32 ranks per node on 1-16 nodes. In generally, it is imprudent to start with full-machine jobs. Run on one node and scale up slowly. I see, and I already finished the smaller calculations, and expect will be inefficient to run on such a large MPI rank, but I did not expect it simply hang ... .... It's possible that it was just running ridiculously slowly. In any case, if you scaled up slowly, you know where the optimal number of nodes is already. Quote:Chiensh Mar 20th 7:29 am I also noticed that the current compiling document for KNL is confusing (http://nwchemgit.github.io/index.php/Compiling_NWChem#How-to:_Intel_Xeon_Phi): ... This section describes both the newer KNL and older KNC hardware, in reverse chronological order. Compiling NWChem on self-hosted Intel Xeon Phi Knights Landing processors NWChem 6.6 (and later versions) support OpenMP threading, which is essential to obtaining good performance with NWChem on Intel Xeon Phi many-core processors. As of November 2016, the development version of NWChem contains threading support in the TCE coupled-cluster codes (primarily non-iterative triples in e.g. CCSD(T)), semi-direct CCSD(T), and plane-wave DFT (i.e. NWPW). ... Our documentation is not always perfect. What do you want to see changed here? I will fix it. Quote:Chiensh Mar 20th 7:29 am however, enabling the USE_F90_ALLOCATABLE flag for stable version of NWChem 6.6 will cause compilation error: ... because l_a and l_t are not defined when USE_F90_ALLOCATABLE is enabled. This is just a bug. It does not exist in the latest version of the code. Can you download the trunk version instead?

Clicked A Few Times

6:08:20 PM PDT - Mon, Mar 20th 2017
Thank you! How can I get the trunk version? Can you give me the link? Thanks! Quote:Jhammond Mar 20th 9:10 am Our documentation is not always perfect. What do you want to see changed here? I will fix it. Quote:Chiensh Mar 20th 7:29 am however, enabling the USE_F90_ALLOCATABLE flag for stable version of NWChem 6.6 will cause compilation error: ... because l_a and l_t are not defined when USE_F90_ALLOCATABLE is enabled. This is just a bug. It does not exist in the latest version of the code. Can you download the trunk version instead?

Gets Around

9:34:06 PM PDT - Mon, Mar 20th 2017
http://nwchemgit.github.io/index.php/Developer#Downloading_from_and_Committing_to_the_NWChem...

Forum >> NWChem's corner >> General Topics