SCF Performance for Different ARMCI Network on Socket-based KNL Cluster

Clicked A Few Times

3:39:35 AM PDT - Fri, Mar 31st 2017

Hi All,

I have been evaluating which ARMCI_NTEWORK most suitable for a socket-based KNL Cluster; the processor type is KNL 7210 @ 1.30GHz, and each node has 192GB memory, OS is RedHat 7.3.

ARMCI_NETWORK=MPI-PR, OpenIB and ARMCI-MPI+CASPER have been evaluated. All 3 copies of NWChem were compiled using the same set of environmental variable, the script is given the the bottom part of this thread. Intel Compiler v17.0.1 and Intel MPI Version 2017 Update 1 Build 20161016 were used for the compilation.

Initially I would like to evaluate the performance for a TCE:CCSD(T) calculation of a pentacene molecule taken which have been used for NWChem the benchmark (http://nwchemgit.github.io/images/Pentacene_ccsdt.nw), however the SCF timing result for the ARMCI-MPI+Casper is too slow compared with other ARMCI network, so I would like to check if any part of my compilation or runtime variable are not correct before going further to CCSD level:

For MPI-PR

            iter       energy          gnorm     gmax       time

           ----- ------------------- --------- --------- --------

               1     -841.3354871523  2.54D-03  3.19D-04     49.8

               2     -841.3354832032  1.14D-03  1.43D-04     89.9

               3     -841.3354832406  1.57D-06  2.64D-07    113.7

      Time for solution =    111.5s

For OpenIB

           iter       energy          gnorm     gmax       time

           ----- ------------------- --------- --------- --------

               1     -841.3354871523  2.54D-03  3.19D-04     53.4

               2     -841.3354832032  1.14D-03  1.43D-04     95.9

               3     -841.3354832406  1.57D-06  2.64D-07    121.1

      Time for solution =    116.4s

For ARMCI-MPI

           iter       energy          gnorm     gmax       time

           ----- ------------------- --------- --------- --------

               1     -841.3354871523  2.54D-03  3.19D-04    236.9

               2     -841.3354832032  1.14D-03  1.43D-04    923.4

               3     -841.3354832406  1.58D-06  2.64D-07   1211.7

      Time for solution =   1263.4s

Thanks!

Regards,
Dominic

Note:
Here is some additional information for the runtime environment, input, compilation environment and the snapshot of the calculation outputs:

(1) The runtime environment was set by this script environment.sh

module load intel

module load impi/2017_Update_1



WORKDIR=/home/users/astar/ihpc/chiensh/nwtest

export EXE1=/home/users/astar/ihpc/chiensh/nwchem-mpipr/bin/LINUX64/nwchem

export EXE2=/home/users/astar/ihpc/chiensh/nwchem-openib/bin/LINUX64/nwchem

export EXE3=/home/users/astar/ihpc/chiensh/nwchem-armci-mpi/bin/LINUX64/nwchem

export CASPER=/home/users/astar/ihpc/chiensh/local/lib/libcasper.so



 ulimit -s unlimited

#export OMP_STACKSIZE=32M

export I_MPI_PIN=1

#export I_MPI_DEBUG=4

unset I_MPI_DEBUG

export KMP_BLOCKTIME=1

#export KMP_AFFINITY=scatter,verbose

export KMP_AFFINITY=scatter

export ARMCI_USE_WIN_ALLOCATE=1



export OMP_NUM_THREADS=60

#export MKL_NUM_THREADS=2

export MKL_DYNAMIC=false

export OMP_NESTED=true

(2) There are some modifications on the tce and scf section and they are shown below

scf

vectors input pentacene.movecs

thresh 1e-4

#semidirect filesize 0 memsize 1000000000

end



tce

freeze atomic

ccsd(t)

thresh 1e-4

io ga

tilesize 50

attilesize 40

2EORB

2EMET 15

end



set tce:nts T

set tce:tceiop 2048

#set tce:writeint T

set tce:readint T

#set tce:writet T

set tce:readt T

set tce:xmem 1000

task tce energy

(3) Snapshot of the outputs
i. For ARMCI_NETWORK=MPI-PR

> mpirun -perhost 2                                            -n 40  $EXE1 ccsdt.nw

          Job information

          ---------------

   hostname        = r1i0n4

   program         = /home/users/astar/ihpc/chiensh/nwchem-mpipr/bin/LINUX64/nwchem

   date            = Fri Mar 31 17:38:54 2017

   compiled        = Fri_Mar_31_15:02:17_2017

   source          = /home/users/astar/ihpc/chiensh/nwchem-mpipr

   nwchem branch   = 6.6

   nwchem revision = 27746

   ga revision     = 10594

   input           = ccsdt.nw

   prefix          = pentacene.

   data base       = ./pentacene.db

   status          = startup

   nproc           =       20

   time left       =     -1s



          Memory information

          ------------------

   heap     =   65536000 doubles =    500.0 Mbytes

   stack    = ********** doubles =  90000.0 Mbytes

   global   = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)

   total    = ********** doubles = 190500.0 Mbytes

   verify   = no

   hardfail = no

...

                                NWChem SCF Module

                                -----------------

 ao basis        = "ao basis"

 functions       =   378

 atoms           =    36

 closed shells   =    73

 open shells     =     0

 charge          =   0.00

 wavefunction    = RHF

 input vectors   = ./pentacene.movecs

 output vectors  = ./pentacene.movecs

 use symmetry    = F

 symmetry adapt  = F

...



----------------------------------------------

        Quadratically convergent ROHF

Convergence threshold     :          1.000E-04

Maximum no. of iterations :           30

Final Fock-matrix accuracy:          1.000E-07

----------------------------------------------



Integral file          = ./pentacene.aoints.00

Record size in doubles =    65536    No. of integs per rec  =    32766

Max. records in memory =     3937    Max. records in file   = 16943190

No. of bits per label  =       16    No. of bits per value  =       64



#quartets = 3.409D+07 #integrals = 3.020D+08 #direct =  0.0% #cached =100.0%



File balance: exchanges=    68  moved=  1370  time=   0.2



             iter       energy          gnorm     gmax       time

            ----- ------------------- --------- --------- --------

                1     -841.3354871523  2.54D-03  3.19D-04     49.8

                2     -841.3354832032  1.14D-03  1.43D-04     89.9

                3     -841.3354832406  1.57D-06  2.64D-07    113.7



      Final RHF  results

      ------------------

        Total SCF energy =   -841.335483240608

     One-electron energy =  -4113.018438040344

     Two-electron energy =   1774.091366579281

Nuclear repulsion energy =   1497.591588220455



       Time for solution =    111.5s

ii.ARMCI_NETWORK=OpenIB

>mpirun -perhost 1                                            -n 20  $EXE2 ccsdt.nw

          Job information

          ---------------

   hostname        = r1i0n4

   program         = /home/users/astar/ihpc/chiensh/nwchem-openib/bin/LINUX64/nwchem

   date            = Fri Mar 31 16:08:37 2017

   compiled        = Thu_Mar_30_19:10:59_2017

   source          = /home/users/astar/ihpc/chiensh/nwchem-openib

   nwchem branch   = Development

   nwchem revision = 27746

   ga revision     = 10594

   input           = ccsdt.nw

   prefix          = pentacene.

   data base       = ./pentacene.db

   status          = startup

   nproc           =       20

   time left       =     -1s







          Memory information

          ------------------

   heap     =   65535998 doubles =    500.0 Mbytes

   stack    = ********** doubles =  90000.0 Mbytes

   global   = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)

   total    = ********** doubles = 190500.0 Mbytes

   verify   = no

   hardfail = no

...

----------------------------------------------

        Quadratically convergent ROHF



Convergence threshold     :          1.000E-04

Maximum no. of iterations :           30

Final Fock-matrix accuracy:          1.000E-07

----------------------------------------------





Integral file          = ./pentacene.aoints.00

Record size in doubles =    65536    No. of integs per rec  =    32766

Max. records in memory =     3937    Max. records in file   = 16908036

No. of bits per label  =       16    No. of bits per value  =       64





#quartets = 3.409D+07 #integrals = 3.020D+08 #direct =  0.0% #cached =100.0%





File balance: exchanges=    77  moved=  1081  time=   0.1





             iter       energy          gnorm     gmax       time

            ----- ------------------- --------- --------- --------

                1     -841.3354871523  2.54D-03  3.19D-04     53.4

                2     -841.3354832032  1.14D-03  1.43D-04     95.9

                3     -841.3354832406  1.57D-06  2.64D-07    121.1





      Final RHF  results

      ------------------



        Total SCF energy =   -841.335483240607

     One-electron energy =  -4113.018438056843

     Two-electron energy =   1774.091366595781

Nuclear repulsion energy =   1497.591588220455



       Time for solution =    116.4s

iii.ARMCI_NETWORK=ARMCI-MPI

>mpirun -perhost 1 -genv CSP_NG 1  -genv LD_PRELOAD $CASPER   -n 40  $EXE3 ccsdt.nw

          Job information

          ---------------



   hostname        = r1i0n4

   program         = /home/users/astar/ihpc/chiensh/nwchem-armci-mpi/bin/LINUX64/nwchem

   date            = Fri Mar 31 15:20:48 2017

   compiled        = Fri_Mar_31_14:18:43_2017

   source          = /home/users/astar/ihpc/chiensh/nwchem-armci-mpi

   nwchem branch   = 6.6

   nwchem revision = 27746

   ga revision     = 10594

   input           = ccsdt.nw

   prefix          = pentacene.

   data base       = ./pentacene.db

   status          = startup

   nproc           =       20

   time left       =     -1s







          Memory information

          ------------------

   heap     =   65535998 doubles =    500.0 Mbytes

   stack    = ********** doubles =  90000.0 Mbytes

   global   = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)

   total    = ********** doubles = 190500.0 Mbytes

   verify   = no

   hardfail = no

 ...

 ----------------------------------------------

        Quadratically convergent ROHF



Convergence threshold     :          1.000E-04

Maximum no. of iterations :           30

Final Fock-matrix accuracy:          1.000E-07

----------------------------------------------





Integral file          = ./pentacene.aoints.00

Record size in doubles =    65536    No. of integs per rec  =    32766

Max. records in memory =     3937    Max. records in file   = 16943273

No. of bits per label  =       16    No. of bits per value  =       64





#quartets = 3.409D+07 #integrals = 3.020D+08 #direct =  0.0% #cached =100.0%





File balance: exchanges=    67  moved=  1250  time=   0.1





             iter       energy          gnorm     gmax       time

            ----- ------------------- --------- --------- --------

                1     -841.3354871523  2.54D-03  3.19D-04    236.9

                2     -841.3354832032  1.14D-03  1.43D-04    923.4

                3     -841.3354832406  1.58D-06  2.64D-07   1211.7





      Final RHF  results

      ------------------



        Total SCF energy =   -841.335483240606

     One-electron energy =  -4113.018438085244

     Two-electron energy =   1774.091366624183

Nuclear repulsion energy =   1497.591588220455



       Time for solution =   1263.4s

iv. NWChem were built by the same script build.sh

#export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-mpipr

#export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-openib

export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-armci-mpi

export NWCHEM_TARGET=LINUX64

export ARMCI_NETWORK=ARMCI

export EXTERNAL_ARMCI_PATH=/home/users/astar/ihpc/chiensh/local

export USE_MPI=y

export USE_MPIF=y

export USE_MPIF4=y

export LARGE_FILES=TRUE

export ENABLE_COMPONENT=yes

export USE_OPENMP=y

export LIB_DEFINES=" -DDFLT_TOT_MEM=391422080"

export DISABLE_GAMIRROR=y

export GA_STABLE=ga-5-5



export BLAS_SIZE=8

export BLASOPT="  -Wl,--start-group -L$MKLROOT/lib/intel64  -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -Wl,--end-group -liomp5 -lpthread -lm -ldl -qopenmp " 

#export BLASOPT="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread  -qopenmp -lpthread "

export LAPACK_LIB=$BLASOPT

#export LAPACK_LIB="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread  -qopenmp -lpthread"



export USE_SCALAPACK=yes

export SCALAPACK_SIZE=8

export SCALAPACK="-L/$MKLROOT/lib/intel64  -lmkl_blacs_intelmpi_ilp64  -lmkl_scalapack_ilp64  "

export SCALAPACK_LIB="$SCALAPACK $BLASOPT"

#export SCALAPACK="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64  "

#export SCALAPACK_LIB="$SCALAPACK -qopenmp -lpthread"



#unset SCALAPACK

#unset SCALAPACK_SIZE

#unset SCALAPACK_LIB



#export IB_HOME=/usr

#export IB_INCLUDE=$IB_HOME/include/infiniband

#export IB_LIB=$IB_HOME/lib64

#export IB_LIB_NAME="-libumad -libverbs -lpthread"



export MPI_INCLUDE="-I/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mpi/intel64/include "

export LIBMPI="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mpi/intel64/lib -lmpi -lmpifort  -lmpi_ilp64 -lpthread"



export USE_PYTHON64=y

export PYTHONVERSION=2.7

export PYTHONHOME=/usr

export PYTHONLIBTYPE=so



export NWCHEM_MODULES="all python"

#export NWCHEM_MODULES="all"

#export NWCHEM_MODULES=smallqm



export CCSDTQ=y

export CCSDTLR=y

export MRCC_METHODS=TRUE

#unset CCSDTQ

#unset CCSDTLR

#unset MRCC_METHODS



export CC=icc

export FC=ifort

export F77=ifort

export F90=ifort

export CXX=icpc

export MPICC=icc

export MPIFC=ifort

export FOPTIMIZE="-O3 -qopenmp -heap-arrays 64  -xMIC-AVX512 -fp-model fast=2 -no-prec-div -funroll-loops -unroll-aggressive "

export COPTIMIZE="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -funroll-loops "

v. ARMCI-MPI and Casper were build by these configure command
a ARMCI-MPI

 ../configure MPICC=mpiicc CFLAGS=-L/home/users/astar/ihpc/chiensh/local/lib  MPIEXEC=mpirun --enable-explicit-progress --prefix=/home/users/astar/ihpc/chiensh/local

v. CASPER

../configure --prefix=/home/users/astar/ihpc/chiensh/local CC=mpiicc

Forum Vet

11:51:25 AM PDT - Fri, Mar 31st 2017
Dominic Thanks a lot for your feedback You might want to try the following variables for MPI-PR, since moving as much MPI communication as possible to the eager protocol should decrease latency. setenv I_MPI_EAGER_THRESHOLD 32768 setenv I_MPI_DAPL_DIRECT_COPY_THRESHOLD 32768 setenv I_MPI_FABRICS dapl

Clicked A Few Times

9:08:29 PM PDT - Fri, Mar 31st 2017
Thank you Edo, (1) What is the suitable combinations of OMP_NUM_THREADS and MKL_NUM_THREADS on a 64 cores KML systems? I believe thread based parallelization is used by SCF through MKL, and based on my observation, if I assign 1 MPI rank (for OpenIB), or 2 MPI ranks (for MPI-PR), on each node, it will not be used for more than 20 cores regardless the setting of MKL_NUM_THREADS (I set it 60). In other words, I have to assign more MPI rank on each MKLnode to fully utilize all 64 cores... Update: I assign 5+1 MPI ranks on each node, and it roughly speeds up by 5 times in SCF. Am I suppose to run 60 MPI ranks on each node? (2) For ARMCI-MPI+CASPER, I observed that SCF (and subsequent integral transformation for CCSD calculation) was run only 1 thread on each node, that's the reason why it is 4 times slower than OpenIB and MPI-PR. how do I enable the multi threads parallelization for AMRCI-MPI+Casper? (3) For OpenIB and MPI-PR, the SCF and CCSD calculations are finished in half an hours on 20 KNL nodes, but it was stuck in the (T) calculation, it could not finish in 20 hours. Iterations converged CCSD correlation energy / hartree = -2.916922299620284 CCSD total energy / hartree = -844.252405540227414 Singles contributions Doubles contributions CCSD(T) Using plain CCSD(T) code Using sliced CCSD(T) code If I disable the sliced (T) algorithm, there will be a "ccsd_t: MA error sgl 17596287801", I believe it is due to insufficient of local memory. 3 weeks ago, Thomas Dunning has presented the NWChemEX project at a Singapore conference, and he mentioned that they have achieved 1 PF/s on 20K nodes in Blue Water with a more efficient (T) algorithm, has this been implemented to the current version of NWChem? Thanks! ~Dominic

Clicked A Few Times

11:31:54 PM PDT - Tue, Apr 4th 2017
CCSD can be calculated fairly efficiently on KNL, but there is no way to get the (T) calculation complete, does anyone has the same problem?

Gets Around

10:18:05 AM PDT - Thu, Apr 6th 2017
The CCSD(T) results that Dunning presented likely correspond to http://pubs.acs.org/doi/abs/10.1021/ct500404c and I think they used the semidirect module rather than TCE.

Gets Around

10:24:34 AM PDT - Thu, Apr 6th 2017
I don't know what the issue with ARMCI-MPI+Casper is, but you should just use whatever works best. ARMCI-MPI exists as an alternative for cases where ARMCI isn't working well. If ARMCI works well, there is no reason not to use it. My guess is there is something wrong in the usage/configuration of ARMCI-MPI, Casper, or MPI, but I don't have time to debug it right now.

Gets Around

10:27:07 AM PDT - Thu, Apr 6th 2017
"mpirun -perhost 1 -genv CSP_NG 1" runs one MPI rank per node, does it not? I don't know how Casper even functions in that case. You need to launch N+G processes per node, where N is the number of application processes per node, and CSP_NG=G.

Clicked A Few Times

9:44:45 PM PDT - Fri, Apr 7th 2017

Thanks Jeff!
However, I believe the main reason why my copy of ARMCI-MPi+CASPER is much slower because it is only running in single thread while other copies using at least 3 - 4 threads on each MPI rank; I don't know why OpenMP and MKL thread is not running in this case.

Thanks for Edo's help, I managed to get the (T) calculation completed using the source code downloaded from the developer trunk (in which setting USE_F90_ALLOCATABLE=T and USE_KNL=y, which defines DINTEL_64ALIGN in the compilation)

CCSD(T)

Using plain CCSD(T) code

Using sliced CCSD(T) code



CCSD[T]  correction energy / hartree =        -0.150973709276513

CCSD[T] correlation energy / hartree =        -3.067895958443244

CCSD[T] total energy / hartree       =      -844.403379199049482

CCSD(T)  correction energy / hartree =        -0.147995713401607

CCSD(T) correlation energy / hartree =        -3.064917962568339

CCSD(T) total energy / hartree       =      -844.400401203174624

Cpu & wall time / sec        66318.6        20408.5



Parallel integral file used   12303 records with       0 large values





Task  times  cpu:    68936.9s     wall:    22442.4s





                               NWChem Input Module

                                -------------------





 Summary of allocated global arrays

-----------------------------------

  No active global arrays

On the other hand, in the developer trunk, updated subroutine grad_v_lr_loca in src/nwpw/pspw/lib/psp/psp.F will not be compiled by ifort Version 17.0.1.132 when O2/O3 and -qopenmp are used together due to an unknown compiler error:

ifort  -c -i8 -align -fpp -qopt-report-file=stderr -qopenmp -qopt-report-phase=openmp -qno-openmp-offload -fimf-arch-consistency=true -finline-limit=250 -O3  -unroll  -ip -xMIC-AVX512  -I.  -I/home/users/astar/ihpc/chiensh/nwchem-dev/src/include -I/home/users/astar/ihpc/chiensh/nwchem-dev/src/tools/install/include -DEXT_INT -DLINUX -DLINUX64 -DUSE_OPENMP  -DIFCV8 -DIFCLINUX -DINTEL_64ALIGN -DINTEL_64ALIGN -DSCALAPACK -DPARALLEL_DIAG -DUSE_F90_ALLOCATABLE psp.F

Intel(R) Advisor can now assist with vectorization and show optimization

  report messages with your source code.

See "https://software.intel.com/en-us/intel-advisor-xe" for details.





Begin optimization report for: PSP_PRJ_INDX_ALLOC_SW1A_SW2A



   Report from: OpenMP optimizations [openmp]



psp.F(684:7-684:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_:  OpenMP multithreaded code generation for MASTER was successful

psp.F(687:7-687:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_:  OpenMP multithreaded code generation for BARRIER was successful

psp.F(700:7-700:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_:  OpenMP multithreaded code generation for MASTER was successful

psp.F(704:7-704:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_:  OpenMP multithreaded code generation for BARRIER was successful

===========================================================================
ifort: error #10105: /home/users/app/intel/compilers_and_libraries_2017.1.132/linux/bin/intel64/fortcom: core dumped

ifort: warning #10102: unknown signal(-326317584)

Segmentation fault (core dumped)

A small modification show in the following seems to be able to fix the problem

1236c       integer ftmp(2)

1237       integer ftmp(2),ftemp

...

1436 !$OMP DO

1437       do j=1,nion

1438          ftemp=ftmp(1)+3*(j-1)

1439          fion(1,j) = fion(1,j) + dbl_mb(ftemp)

1440          fion(2,j) = fion(2,j) + dbl_mb(ftemp+1)

1441          fion(3,j) = fion(3,j) + dbl_mb(ftemp+2)

1442

1443 c         fion(1,j) = fion(1,j) + dbl_mb(ftmp(1)+3*(j-1))

1444 c         fion(2,j) = fion(2,j) + dbl_mb(ftmp(1)+3*(j-1)+1)

1445 c         fion(3,j) = fion(3,j) + dbl_mb(ftmp(1)+3*(j-1)+2)

1446       end do

1447 !$OMP END DO

Forum >> NWChem's corner >> Running NWChem