Progress/Plan for optimizing NWChem to Xeon Phi Knights Landing?


Click here for full thread
Clicked A Few Times
Than you Jeff
Quote:Jhammond Mar 15th 9:53 pm

All the work I know of pertains to NWPW and CC modules.


That also what I expect, and I understand that these 2 methods are so important for NWChem and they have highest priority to be ported over this architecture, however HF and DFT scf are the fundamental part of almost all calculations, so I hope OMP/MPI hybrid parallelizations will be available ASAP.

Quote:Jhammond Mar 15th 9:53 pm

Please also try ARMCI_NETWORK=MPI-PR.

When you run with ARMCI-MPI, please set ARMCI_USE_WIN_ALLOCATE=1 in your environment or manually configure with --enable-win-allocate.

If you use ARMCI-MPI, it often helps to use http://www.mcs.anl.gov/project/casper/ as well. Write Casper user list for assistance if necessary. We will be updating the docs related to NWChem very soon.

Thank you, and I will try all these network setting and will report my finding later.

Quote:Jhammond Mar 15th 9:53 pm

This is only going to use OpenMP in MKL calls and the benefit will be small. Please run with OMP_NUM_THREADS=1 unless you use NWPW or CC.

OK

Quote:Jhammond Mar 15th 9:53 pm

C240 isn't big enough for that many MPI ranks. Try running 32 ranks per node on 1-16 nodes.

In generally, it is imprudent to start with full-machine jobs. Run on one node and scale up slowly.

I see, and I already finished the smaller calculations, and expect will be inefficient to run on such a large MPI rank, bud I did not expect it simply hang ... ....

Quote:Jhammond Mar 15th 9:53 pm

It looks fine. I think we found MPI-PR was better for EDR IB but only a small amount.




I also noticed that the current compiling document for KNL is confusing (http://nwchemgit.github.io/index.php/Compiling_NWChem#How-to:_Intel_Xeon_Phi):
...
This section describes both the newer KNL and older KNC hardware, in reverse chronological order.
Compiling NWChem on self-hosted Intel Xeon Phi Knights Landing processors
NWChem 6.6 (and later versions) support OpenMP threading, which is essential to obtaining good performance with NWChem on Intel Xeon Phi many-core processors.
As of November 2016, the development version of NWChem contains threading support in the TCE coupled-cluster codes (primarily non-iterative triples in e.g. CCSD(T)), semi-direct CCSD(T), and plane-wave DFT (i.e. NWPW).
...

in the document, it stated that these option will enable the optimization for CCSD and NWPW on KNL
% setenv USE_OPENMP 1
% setenv USE_F90_ALLOCATABLE T
% setenv USE_FASTMEM T

however, enabling the USE_F90_ALLOCATABLE flag for stable version of NWChem 6.6 will cause compilation error:

ccsd_t2_8.F(489): error #6404: This name does not have a type, and must have an explicit type.   [L_A]
if (e_a) call errquit("MA pops a",l_a,MA_ERR)
----------------------------------------^
ccsd_t2_8.F(490): error #6404: This name does not have a type, and must have an explicit type. [L_T]
if (e_t) call errquit("MA pops t",l_t,MA_ERR)
----------------------------------------^
compilation aborted for ccsd_t2_8.F (code 1)
make[3]: *** [/home/users/astar/ihpc/chiensh/nwchem-6.6/lib/LINUX64/libtce.a(ccsd_t2_8.o)] Error 1
make[3]: *** Waiting for unfinished jobs....
=========================================================

beacause l_a and l_t are not defined when USE_F90_ALLOCATABLE is enabled.
...
473 #ifdef USE_F90_ALLOCATABLE
474 deallocate(f_a,stat=e_a)
475 deallocate(f_b,stat=e_b)
476 deallocate(f_c,stat=e_c)
477 # ifndef USE_LOOPS_NOT_DGEMM
478 deallocate(f_t,stat=e_t)
479 # endif
480 #else
481 # ifndef USE_LOOPS_NOT_DGEMM
482 e_t=.not.MA_POP_STACK(l_t)
483 # else
484 l_t=-12345
485 e_t=.false.
486 # endif
487 e_a=.not.MA_chOP_STACK(l_a)
488 #endif
489 if (e_a) call errquit("MA pops a",l_a,MA_ERR)
490 if (e_t) call errquit("MA pops t",l_t,MA_ERR)
491 RETURN
492 END
...