Progress/Plan for optimizing NWChem to Xeon Phi Knights Landing?


Just Got Here
I'm developing HPC system products that focus on Knights Landing (KNL), the next generation Xeon Phi product from Intel.
To achieve higher computational performance with KNL, it is very important to optimize application softwares to KNL. So, I am investigating progress of such optimizations of various HPC applications.

I'd like to help customers implement HPC systems in a planned way, by providing information on development progress of KNL-related products.

Would you please let me know your progress or plan of optimization of NWChem to KNL ?

Best regards,
HIROMASA

Gets Around
Hiromasa,


Eric Bylaska

Just Got Here
Bylaska,

Thank you very much for sharing NWChem upgrade status in detail.
I wish great success of your works.

Thank you,
HIROMASA

Gets Around
Because KNL is binary-compatible with Xeon processors in the Haswell/Broadwell generation, every feature should be functional, although I cannot remember if I have run the QA suite or not. I was able to build and run NWChem on the first attempt with pre-production hardware.

As Eric said, PNNL, LBNL and Intel are collaborating [1,2] on the optimization of NWChem, especially NWPW and TCE. If you need performance numbers for business purposes, please contact me privately (I work for Intel).

[1] https://software.intel.com/en-us/articles/ipcc-at-environmental-molecular-sciences-laborat...
[2] https://software.intel.com/en-us/articles/intel-parallel-computing-center-at-lawrence-berk...

Clicked A Few Times
Hi All,

May I know is there any update on the nwchem development for KNL (socket based version)?

We have a 144-node KNL system and I managed to compiled a copy of NWChem with ARMCI-MPi using intel compiler and impi, OpenMP is enabled and MIC-AVX512 was added in the compilers flags.

When I run a C240 dft benchmark job across the full system,

(1) I run 144 MPI task (1 per node) and OMP_NUM_THREADS=64, it works, but is not impressingly fast, I noted that no more then 1 core in each node (or socket) has ever been used (I supposed OMP/hybrid parallization has not been implemented for DFT yet, please correct me if I am wrong)

(2) I run the same job with 9216 MPI tasks instead (1 task per core), but it just hang after printing out the basis set information
Summary of "ao basis" -> "ao basis" (cartesian)
------------------------------------------------------------------------------
Tag Description Shells Functions and Types
---------------- ------------------------------ ------ ---------------------
C user specified 6 15 3s2p1d
(hang here)


Can anyone suggested me the best way to build nwchem on a KNL system with EDR IB connected, i.e. what is the choice of ARMCI_NETWORK, what combination of MKL, LAPACK and Scalapack, as well as how to use the MIC-AVX512 instruction. In addition also please suggest the best way to run nwchem on this system (i.e. pure MPI or hybrid MPI-OMP ?)

Thanks a lot!

PS I just found this section of the compilation instruction, but welcome if there are more hints to me, Thanks!
http://nwchemgit.github.io/index.php/Compiling_NWChem#How-to:_Intel_Xeon_Phi


~ Dominic Chien

Gets Around
Quote:Chiensh Mar 15th 11:29 am

May I know is there any update on the nwchem development for KNL (socket based version)?


All the work I know of pertains to NWPW and CC modules.

Quote:Chiensh Mar 15th 11:29 am

We have a 144-node KNL system and I managed to compiled a copy of NWChem with ARMCI-MPi using intel compiler and impi, OpenMP is enabled and MIC-AVX512 was added in the compilers flags.


Please also try ARMCI_NETWORK=MPI-PR.

When you run with ARMCI-MPI, please set ARMCI_USE_WIN_ALLOCATE=1 in your environment or manually configure with --enable-win-allocate.

If you use ARMCI-MPI, it often helps to use http://www.mcs.anl.gov/project/casper/ as well. Write Casper user list for assistance if necessary. We will be updating the docs related to NWChem very soon.

Quote:Chiensh Mar 15th 11:29 am

When I run a C240 dft benchmark job across the full system,

(1) I run 144 MPI task (1 per node) and OMP_NUM_THREADS=64, it works, but is not impressingly fast, I noted that no more then 1 core in each node (or socket) has ever been used (I supposed OMP/hybrid parallization has not been implemented for DFT yet, please correct me if I am wrong)


This is only going to use OpenMP in MKL calls and the benefit will be small. Please run with OMP_NUM_THREADS=1 unless you use NWPW or CC.

Quote:Chiensh Mar 15th 11:29 am

(2) I run the same job with 9216 MPI tasks instead (1 task per core), but it just hang after printing out the basis set information
Summary of "ao basis" -> "ao basis" (cartesian)
------------------------------------------------------------------------------
Tag Description Shells Functions and Types
---------------- ------------------------------ ------ ---------------------
C user specified 6 15 3s2p1d
(hang here)


C240 isn't big enough for that many MPI ranks. Try running 32 ranks per node on 1-16 nodes.

In generally, it is imprudent to start with full-machine jobs. Run on one node and scale up slowly.

Quote:Chiensh Mar 15th 11:29 am

Can anyone suggested me the best way to build nwchem on a KNL system with EDR IB connected, i.e. what is the choice of ARMCI_NETWORK, what combination of MKL, LAPACK and Scalapack, as well as how to use the MIC-AVX512 instruction. In addition also please suggest the best way to run nwchem on this system (i.e. pure MPI or hybrid MPI-OMP ?)


It looks fine. I think we found MPI-PR was better for EDR IB but only a small amount.

Clicked A Few Times
Than you Jeff
Quote:Jhammond Mar 15th 9:53 pm

All the work I know of pertains to NWPW and CC modules.


That also what I expect, and I understand that these 2 methods are so important for NWChem and they have highest priority to be ported over this architecture, however HF and DFT scf are the fundamental part of almost all calculations, so I hope OMP/MPI hybrid parallelizations will be available ASAP.

Quote:Jhammond Mar 15th 9:53 pm

Please also try ARMCI_NETWORK=MPI-PR.

When you run with ARMCI-MPI, please set ARMCI_USE_WIN_ALLOCATE=1 in your environment or manually configure with --enable-win-allocate.

If you use ARMCI-MPI, it often helps to use http://www.mcs.anl.gov/project/casper/ as well. Write Casper user list for assistance if necessary. We will be updating the docs related to NWChem very soon.

Thank you, and I will try all these network setting and will report my finding later.

Quote:Jhammond Mar 15th 9:53 pm

This is only going to use OpenMP in MKL calls and the benefit will be small. Please run with OMP_NUM_THREADS=1 unless you use NWPW or CC.

OK

Quote:Jhammond Mar 15th 9:53 pm

C240 isn't big enough for that many MPI ranks. Try running 32 ranks per node on 1-16 nodes.

In generally, it is imprudent to start with full-machine jobs. Run on one node and scale up slowly.

I see, and I already finished the smaller calculations, and expect will be inefficient to run on such a large MPI rank, bud I did not expect it simply hang ... ....

Quote:Jhammond Mar 15th 9:53 pm

It looks fine. I think we found MPI-PR was better for EDR IB but only a small amount.




I also noticed that the current compiling document for KNL is confusing (http://nwchemgit.github.io/index.php/Compiling_NWChem#How-to:_Intel_Xeon_Phi):
...
This section describes both the newer KNL and older KNC hardware, in reverse chronological order.
Compiling NWChem on self-hosted Intel Xeon Phi Knights Landing processors
NWChem 6.6 (and later versions) support OpenMP threading, which is essential to obtaining good performance with NWChem on Intel Xeon Phi many-core processors.
As of November 2016, the development version of NWChem contains threading support in the TCE coupled-cluster codes (primarily non-iterative triples in e.g. CCSD(T)), semi-direct CCSD(T), and plane-wave DFT (i.e. NWPW).
...

in the document, it stated that these option will enable the optimization for CCSD and NWPW on KNL
% setenv USE_OPENMP 1
% setenv USE_F90_ALLOCATABLE T
% setenv USE_FASTMEM T

however, enabling the USE_F90_ALLOCATABLE flag for stable version of NWChem 6.6 will cause compilation error:

ccsd_t2_8.F(489): error #6404: This name does not have a type, and must have an explicit type.   [L_A]
if (e_a) call errquit("MA pops a",l_a,MA_ERR)
----------------------------------------^
ccsd_t2_8.F(490): error #6404: This name does not have a type, and must have an explicit type. [L_T]
if (e_t) call errquit("MA pops t",l_t,MA_ERR)
----------------------------------------^
compilation aborted for ccsd_t2_8.F (code 1)
make[3]: *** [/home/users/astar/ihpc/chiensh/nwchem-6.6/lib/LINUX64/libtce.a(ccsd_t2_8.o)] Error 1
make[3]: *** Waiting for unfinished jobs....
=========================================================

beacause l_a and l_t are not defined when USE_F90_ALLOCATABLE is enabled.
...
473 #ifdef USE_F90_ALLOCATABLE
474 deallocate(f_a,stat=e_a)
475 deallocate(f_b,stat=e_b)
476 deallocate(f_c,stat=e_c)
477 # ifndef USE_LOOPS_NOT_DGEMM
478 deallocate(f_t,stat=e_t)
479 # endif
480 #else
481 # ifndef USE_LOOPS_NOT_DGEMM
482 e_t=.not.MA_POP_STACK(l_t)
483 # else
484 l_t=-12345
485 e_t=.false.
486 # endif
487 e_a=.not.MA_chOP_STACK(l_a)
488 #endif
489 if (e_a) call errquit("MA pops a",l_a,MA_ERR)
490 if (e_t) call errquit("MA pops t",l_t,MA_ERR)
491 RETURN
492 END
...

Gets Around
Quote:Chiensh Mar 20th 7:29 am
Than you Jeff
Quote:Jhammond Mar 15th 9:53 pm

All the work I know of pertains to NWPW and CC modules.

That also what I expect, and I understand that these 2 methods are so important for NWChem and they have highest priority to be ported over this architecture, however HF and DFT scf are the fundamental part of almost all calculations, so I hope OMP/MPI hybrid parallelizations will be available ASAP.

We understand this. However, SCF calculations bottleneck in atomic integrals. The NWChem atomic integral library is fast on standard server hardware (e.g. Xeon) but it is not vectorized nor is it threaded. It's not even thread-safe, either, so we either need to rewrite most of the code or refactor the code to use another atomic integral library. Neither of these efforts are easy.

Quote:Chiensh Mar 20th 7:29 am

Quote:Jhammond Mar 15th 9:53 pm

C240 isn't big enough for that many MPI ranks. Try running 32 ranks per node on 1-16 nodes.
In generally, it is imprudent to start with full-machine jobs. Run on one node and scale up slowly.

I see, and I already finished the smaller calculations, and expect will be inefficient to run on such a large MPI rank, but I did not expect it simply hang ... ....


It's possible that it was just running ridiculously slowly. In any case, if you scaled up slowly, you know where the optimal number of nodes is already.

Quote:Chiensh Mar 20th 7:29 am

I also noticed that the current compiling document for KNL is confusing (http://nwchemgit.github.io/index.php/Compiling_NWChem#How-to:_Intel_Xeon_Phi):
...
This section describes both the newer KNL and older KNC hardware, in reverse chronological order.
Compiling NWChem on self-hosted Intel Xeon Phi Knights Landing processors
NWChem 6.6 (and later versions) support OpenMP threading, which is essential to obtaining good performance with NWChem on Intel Xeon Phi many-core processors.
As of November 2016, the development version of NWChem contains threading support in the TCE coupled-cluster codes (primarily non-iterative triples in e.g. CCSD(T)), semi-direct CCSD(T), and plane-wave DFT (i.e. NWPW).
...


Our documentation is not always perfect. What do you want to see changed here? I will fix it.

Quote:Chiensh Mar 20th 7:29 am

however, enabling the USE_F90_ALLOCATABLE flag for stable version of NWChem 6.6 will cause compilation error:
...
because l_a and l_t are not defined when USE_F90_ALLOCATABLE is enabled.


This is just a bug. It does not exist in the latest version of the code. Can you download the trunk version instead?

Clicked A Few Times
Thank you!

How can I get the trunk version? Can you give me the link? Thanks!
Quote:Jhammond Mar 20th 9:10 am


Our documentation is not always perfect. What do you want to see changed here? I will fix it.

Quote:Chiensh Mar 20th 7:29 am

however, enabling the USE_F90_ALLOCATABLE flag for stable version of NWChem 6.6 will cause compilation error:
...
because l_a and l_t are not defined when USE_F90_ALLOCATABLE is enabled.


This is just a bug. It does not exist in the latest version of the code. Can you download the trunk version instead?

Gets Around
http://nwchemgit.github.io/index.php/Developer#Downloading_from_and_Committing_to_the_NWChem...


Forum >> NWChem's corner >> General Topics