SCF Performance for Different ARMCI Network on Socket-based KNL Cluster


Clicked A Few Times
Hi All,

I have been evaluating which ARMCI_NTEWORK most suitable for a socket-based KNL Cluster; the processor type is KNL 7210 @ 1.30GHz, and each node has 192GB memory, OS is RedHat 7.3.

ARMCI_NETWORK=MPI-PR, OpenIB and ARMCI-MPI+CASPER have been evaluated. All 3 copies of NWChem were compiled using the same set of environmental variable, the script is given the the bottom part of this thread. Intel Compiler v17.0.1 and Intel MPI Version 2017 Update 1 Build 20161016 were used for the compilation.

Initially I would like to evaluate the performance for a TCE:CCSD(T) calculation of a pentacene molecule taken which have been used for NWChem the benchmark (http://nwchemgit.github.io/images/Pentacene_ccsdt.nw), however the SCF timing result for the ARMCI-MPI+Casper is too slow compared with other ARMCI network, so I would like to check if any part of my compilation or runtime variable are not correct before going further to CCSD level:


For MPI-PR
            iter       energy          gnorm     gmax       time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 49.8
2 -841.3354832032 1.14D-03 1.43D-04 89.9
3 -841.3354832406 1.57D-06 2.64D-07 113.7
Time for solution = 111.5s

For OpenIB
           iter       energy          gnorm     gmax       time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 53.4
2 -841.3354832032 1.14D-03 1.43D-04 95.9
3 -841.3354832406 1.57D-06 2.64D-07 121.1
Time for solution = 116.4s

For ARMCI-MPI
           iter       energy          gnorm     gmax       time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 236.9
2 -841.3354832032 1.14D-03 1.43D-04 923.4
3 -841.3354832406 1.58D-06 2.64D-07 1211.7
Time for solution = 1263.4s


Thanks!

Regards,
Dominic

Note:
Here is some additional information for the runtime environment, input, compilation environment and the snapshot of the calculation outputs:


(1) The runtime environment was set by this script environment.sh
module load intel
module load impi/2017_Update_1

WORKDIR=/home/users/astar/ihpc/chiensh/nwtest
export EXE1=/home/users/astar/ihpc/chiensh/nwchem-mpipr/bin/LINUX64/nwchem
export EXE2=/home/users/astar/ihpc/chiensh/nwchem-openib/bin/LINUX64/nwchem
export EXE3=/home/users/astar/ihpc/chiensh/nwchem-armci-mpi/bin/LINUX64/nwchem
export CASPER=/home/users/astar/ihpc/chiensh/local/lib/libcasper.so

ulimit -s unlimited
#export OMP_STACKSIZE=32M
export I_MPI_PIN=1
#export I_MPI_DEBUG=4
unset I_MPI_DEBUG
export KMP_BLOCKTIME=1
#export KMP_AFFINITY=scatter,verbose
export KMP_AFFINITY=scatter
export ARMCI_USE_WIN_ALLOCATE=1

export OMP_NUM_THREADS=60
#export MKL_NUM_THREADS=2
export MKL_DYNAMIC=false
export OMP_NESTED=true


(2) There are some modifications on the tce and scf section and they are shown below
scf
vectors input pentacene.movecs
thresh 1e-4
#semidirect filesize 0 memsize 1000000000
end

tce
freeze atomic
ccsd(t)
thresh 1e-4
io ga
tilesize 50
attilesize 40
2EORB
2EMET 15
end

set tce:nts T
set tce:tceiop 2048
#set tce:writeint T
set tce:readint T
#set tce:writet T
set tce:readt T
set tce:xmem 1000
task tce energy

(3) Snapshot of the outputs
i. For ARMCI_NETWORK=MPI-PR
> mpirun -perhost 2                                            -n 40  $EXE1 ccsdt.nw 

          Job information
---------------
hostname = r1i0n4
program = /home/users/astar/ihpc/chiensh/nwchem-mpipr/bin/LINUX64/nwchem
date = Fri Mar 31 17:38:54 2017
compiled = Fri_Mar_31_15:02:17_2017
source = /home/users/astar/ihpc/chiensh/nwchem-mpipr
nwchem branch = 6.6
nwchem revision = 27746
ga revision = 10594
input = ccsdt.nw
prefix = pentacene.
data base = ./pentacene.db
status = startup
nproc = 20
time left = -1s

Memory information
------------------
heap = 65536000 doubles = 500.0 Mbytes
stack = ********** doubles = 90000.0 Mbytes
global = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)
total = ********** doubles = 190500.0 Mbytes
verify = no
hardfail = no
...
NWChem SCF Module
-----------------
ao basis = "ao basis"
functions = 378
atoms = 36
closed shells = 73
open shells = 0
charge = 0.00
wavefunction = RHF
input vectors = ./pentacene.movecs
output vectors = ./pentacene.movecs
use symmetry = F
symmetry adapt = F
...

----------------------------------------------
Quadratically convergent ROHF
Convergence threshold  : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------

Integral file = ./pentacene.aoints.00
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 3937 Max. records in file = 16943190
No. of bits per label = 16 No. of bits per value = 64

#quartets = 3.409D+07 #integrals = 3.020D+08 #direct = 0.0% #cached =100.0%

File balance: exchanges= 68 moved= 1370 time= 0.2

iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 49.8
2 -841.3354832032 1.14D-03 1.43D-04 89.9
3 -841.3354832406 1.57D-06 2.64D-07 113.7

Final RHF results
------------------
Total SCF energy = -841.335483240608
One-electron energy = -4113.018438040344
Two-electron energy = 1774.091366579281
Nuclear repulsion energy = 1497.591588220455

Time for solution = 111.5s


ii.ARMCI_NETWORK=OpenIB
>mpirun -perhost 1                                            -n 20  $EXE2 ccsdt.nw

          Job information
---------------
hostname = r1i0n4
program = /home/users/astar/ihpc/chiensh/nwchem-openib/bin/LINUX64/nwchem
date = Fri Mar 31 16:08:37 2017
compiled = Thu_Mar_30_19:10:59_2017
source = /home/users/astar/ihpc/chiensh/nwchem-openib
nwchem branch = Development
nwchem revision = 27746
ga revision = 10594
input = ccsdt.nw
prefix = pentacene.
data base = ./pentacene.db
status = startup
nproc = 20
time left = -1s



Memory information
------------------
heap = 65535998 doubles = 500.0 Mbytes
stack = ********** doubles = 90000.0 Mbytes
global = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)
total = ********** doubles = 190500.0 Mbytes
verify = no
hardfail = no
...
----------------------------------------------
Quadratically convergent ROHF

Convergence threshold  : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------


Integral file = ./pentacene.aoints.00
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 3937 Max. records in file = 16908036
No. of bits per label = 16 No. of bits per value = 64


#quartets = 3.409D+07 #integrals = 3.020D+08 #direct = 0.0% #cached =100.0%


File balance: exchanges= 77 moved= 1081 time= 0.1


iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 53.4
2 -841.3354832032 1.14D-03 1.43D-04 95.9
3 -841.3354832406 1.57D-06 2.64D-07 121.1


Final RHF results
------------------

Total SCF energy = -841.335483240607
One-electron energy = -4113.018438056843
Two-electron energy = 1774.091366595781
Nuclear repulsion energy = 1497.591588220455

Time for solution = 116.4s


iii.ARMCI_NETWORK=ARMCI-MPI
>mpirun -perhost 1 -genv CSP_NG 1  -genv LD_PRELOAD $CASPER   -n 40  $EXE3 ccsdt.nw

          Job information
---------------

hostname = r1i0n4
program = /home/users/astar/ihpc/chiensh/nwchem-armci-mpi/bin/LINUX64/nwchem
date = Fri Mar 31 15:20:48 2017
compiled = Fri_Mar_31_14:18:43_2017
source = /home/users/astar/ihpc/chiensh/nwchem-armci-mpi
nwchem branch = 6.6
nwchem revision = 27746
ga revision = 10594
input = ccsdt.nw
prefix = pentacene.
data base = ./pentacene.db
status = startup
nproc = 20
time left = -1s



Memory information
------------------
heap = 65535998 doubles = 500.0 Mbytes
stack = ********** doubles = 90000.0 Mbytes
global = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)
total = ********** doubles = 190500.0 Mbytes
verify = no
hardfail = no
...
----------------------------------------------
Quadratically convergent ROHF

Convergence threshold  : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------


Integral file = ./pentacene.aoints.00
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 3937 Max. records in file = 16943273
No. of bits per label = 16 No. of bits per value = 64


#quartets = 3.409D+07 #integrals = 3.020D+08 #direct = 0.0% #cached =100.0%


File balance: exchanges= 67 moved= 1250 time= 0.1


iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 236.9
2 -841.3354832032 1.14D-03 1.43D-04 923.4
3 -841.3354832406 1.58D-06 2.64D-07 1211.7


Final RHF results
------------------

Total SCF energy = -841.335483240606
One-electron energy = -4113.018438085244
Two-electron energy = 1774.091366624183
Nuclear repulsion energy = 1497.591588220455

Time for solution = 1263.4s



iv. NWChem were built by the same script build.sh
#export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-mpipr
#export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-openib
export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-armci-mpi
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=ARMCI
export EXTERNAL_ARMCI_PATH=/home/users/astar/ihpc/chiensh/local
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export ENABLE_COMPONENT=yes
export USE_OPENMP=y
export LIB_DEFINES=" -DDFLT_TOT_MEM=391422080"
export DISABLE_GAMIRROR=y
export GA_STABLE=ga-5-5

export BLAS_SIZE=8
export BLASOPT=" -Wl,--start-group -L$MKLROOT/lib/intel64 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -Wl,--end-group -liomp5 -lpthread -lm -ldl -qopenmp "
#export BLASOPT="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -qopenmp -lpthread "
export LAPACK_LIB=$BLASOPT
#export LAPACK_LIB="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -qopenmp -lpthread"

export USE_SCALAPACK=yes
export SCALAPACK_SIZE=8
export SCALAPACK="-L/$MKLROOT/lib/intel64 -lmkl_blacs_intelmpi_ilp64 -lmkl_scalapack_ilp64 "
export SCALAPACK_LIB="$SCALAPACK $BLASOPT"
#export SCALAPACK="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 "
#export SCALAPACK_LIB="$SCALAPACK -qopenmp -lpthread"

#unset SCALAPACK
#unset SCALAPACK_SIZE
#unset SCALAPACK_LIB

#export IB_HOME=/usr
#export IB_INCLUDE=$IB_HOME/include/infiniband
#export IB_LIB=$IB_HOME/lib64
#export IB_LIB_NAME="-libumad -libverbs -lpthread"

export MPI_INCLUDE="-I/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mpi/intel64/include "
export LIBMPI="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mpi/intel64/lib -lmpi -lmpifort -lmpi_ilp64 -lpthread"

export USE_PYTHON64=y
export PYTHONVERSION=2.7
export PYTHONHOME=/usr
export PYTHONLIBTYPE=so

export NWCHEM_MODULES="all python"
#export NWCHEM_MODULES="all"
#export NWCHEM_MODULES=smallqm

export CCSDTQ=y
export CCSDTLR=y
export MRCC_METHODS=TRUE
#unset CCSDTQ
#unset CCSDTLR
#unset MRCC_METHODS

export CC=icc
export FC=ifort
export F77=ifort
export F90=ifort
export CXX=icpc
export MPICC=icc
export MPIFC=ifort
export FOPTIMIZE="-O3 -qopenmp -heap-arrays 64 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -funroll-loops -unroll-aggressive "
export COPTIMIZE="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -funroll-loops "

v. ARMCI-MPI and Casper were build by these configure command
a ARMCI-MPI
 ../configure MPICC=mpiicc CFLAGS=-L/home/users/astar/ihpc/chiensh/local/lib  MPIEXEC=mpirun --enable-explicit-progress --prefix=/home/users/astar/ihpc/chiensh/local

v. CASPER
../configure --prefix=/home/users/astar/ihpc/chiensh/local CC=mpiicc

Forum Vet
Dominic
Thanks a lot for your feedback
You might want to try the following variables for MPI-PR, since moving as much MPI communication as possible to the eager protocol should decrease latency.

setenv I_MPI_EAGER_THRESHOLD 32768
setenv I_MPI_DAPL_DIRECT_COPY_THRESHOLD 32768
setenv I_MPI_FABRICS dapl

Clicked A Few Times
Thank you Edo,

(1) What is the suitable combinations of OMP_NUM_THREADS and MKL_NUM_THREADS on a 64 cores KML systems? I believe thread based parallelization is used by SCF through MKL, and based on my observation, if I assign 1 MPI rank (for OpenIB), or 2 MPI ranks (for MPI-PR), on each node, it will not be used for more than 20 cores regardless the setting of MKL_NUM_THREADS (I set it 60). In other words, I have to assign more MPI rank on each MKLnode to fully utilize all 64 cores...

Update: I assign 5+1 MPI ranks on each node, and it roughly speeds up by 5 times in SCF. Am I suppose to run 60 MPI ranks on each node?

(2) For ARMCI-MPI+CASPER, I observed that SCF (and subsequent integral transformation for CCSD calculation) was run only 1 thread on each node, that's the reason why it is 4 times slower than OpenIB and MPI-PR. how do I enable the multi threads parallelization for AMRCI-MPI+Casper?


(3) For OpenIB and MPI-PR, the SCF and CCSD calculations are finished in half an hours on 20 KNL nodes, but it was stuck in the (T) calculation, it could not finish in 20 hours.


 Iterations converged
CCSD correlation energy / hartree = -2.916922299620284
CCSD total energy / hartree = -844.252405540227414

Singles contributions

Doubles contributions
CCSD(T)
Using plain CCSD(T) code
Using sliced CCSD(T) code

If I disable the sliced (T) algorithm, there will be a "ccsd_t: MA error sgl 17596287801", I believe it is due to insufficient of local memory.

3 weeks ago, Thomas Dunning has presented the NWChemEX project at a Singapore conference, and he mentioned that they have achieved 1 PF/s on 20K nodes in Blue Water with a more efficient (T) algorithm, has this been implemented to the current version of NWChem?

Thanks!

~Dominic

Clicked A Few Times
CCSD can be calculated fairly efficiently on KNL, but there is no way to get the (T) calculation complete, does anyone has the same problem?

Gets Around
The CCSD(T) results that Dunning presented likely correspond to http://pubs.acs.org/doi/abs/10.1021/ct500404c and I think they used the semidirect module rather than TCE.

Gets Around
I don't know what the issue with ARMCI-MPI+Casper is, but you should just use whatever works best. ARMCI-MPI exists as an alternative for cases where ARMCI isn't working well. If ARMCI works well, there is no reason not to use it. My guess is there is something wrong in the usage/configuration of ARMCI-MPI, Casper, or MPI, but I don't have time to debug it right now.

Gets Around
"mpirun -perhost 1 -genv CSP_NG 1" runs one MPI rank per node, does it not? I don't know how Casper even functions in that case. You need to launch N+G processes per node, where N is the number of application processes per node, and CSP_NG=G.

Clicked A Few Times
Thanks Jeff!
However, I believe the main reason why my copy of ARMCI-MPi+CASPER is much slower because it is only running in single thread while other copies using at least 3 - 4 threads on each MPI rank; I don't know why OpenMP and MKL thread is not running in this case.

Thanks for Edo's help, I managed to get the (T) calculation completed using the source code downloaded from the developer trunk (in which setting USE_F90_ALLOCATABLE=T and USE_KNL=y, which defines DINTEL_64ALIGN in the compilation)

CCSD(T)
Using plain CCSD(T) code
Using sliced CCSD(T) code

CCSD[T] correction energy / hartree = -0.150973709276513
CCSD[T] correlation energy / hartree = -3.067895958443244
CCSD[T] total energy / hartree = -844.403379199049482
CCSD(T) correction energy / hartree = -0.147995713401607
CCSD(T) correlation energy / hartree = -3.064917962568339
CCSD(T) total energy / hartree = -844.400401203174624
Cpu & wall time / sec 66318.6 20408.5

Parallel integral file used 12303 records with 0 large values


Task times cpu: 68936.9s wall: 22442.4s



NWChem Input Module
-------------------


Summary of allocated global arrays
-----------------------------------
No active global arrays



On the other hand, in the developer trunk, updated subroutine grad_v_lr_loca in src/nwpw/pspw/lib/psp/psp.F will not be compiled by ifort Version 17.0.1.132 when O2/O3 and -qopenmp are used together due to an unknown compiler error:

ifort  -c -i8 -align -fpp -qopt-report-file=stderr -qopenmp -qopt-report-phase=openmp -qno-openmp-offload -fimf-arch-consistency=true -finline-limit=250 -O3  -unroll  -ip -xMIC-AVX512  -I.  -I/home/users/astar/ihpc/chiensh/nwchem-dev/src/include -I/home/users/astar/ihpc/chiensh/nwchem-dev/src/tools/install/include -DEXT_INT -DLINUX -DLINUX64 -DUSE_OPENMP  -DIFCV8 -DIFCLINUX -DINTEL_64ALIGN -DINTEL_64ALIGN -DSCALAPACK -DPARALLEL_DIAG -DUSE_F90_ALLOCATABLE psp.F
Intel(R) Advisor can now assist with vectorization and show optimization
report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


Begin optimization report for: PSP_PRJ_INDX_ALLOC_SW1A_SW2A

Report from: OpenMP optimizations [openmp]

psp.F(684:7-684:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_: OpenMP multithreaded code generation for MASTER was successful
psp.F(687:7-687:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_: OpenMP multithreaded code generation for BARRIER was successful
psp.F(700:7-700:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_: OpenMP multithreaded code generation for MASTER was successful
psp.F(704:7-704:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_: OpenMP multithreaded code generation for BARRIER was successful
=========================================================================== ifort: error #10105: /home/users/app/intel/compilers_and_libraries_2017.1.132/linux/bin/intel64/fortcom: core dumped
ifort: warning #10102: unknown signal(-326317584)
Segmentation fault (core dumped)


A small modification show in the following seems to be able to fix the problem

1236c       integer ftmp(2)
1237 integer ftmp(2),ftemp
...
1436 !$OMP DO
1437 do j=1,nion
1438 ftemp=ftmp(1)+3*(j-1)
1439 fion(1,j) = fion(1,j) + dbl_mb(ftemp)
1440 fion(2,j) = fion(2,j) + dbl_mb(ftemp+1)
1441 fion(3,j) = fion(3,j) + dbl_mb(ftemp+2)
1442
1443 c fion(1,j) = fion(1,j) + dbl_mb(ftmp(1)+3*(j-1))
1444 c fion(2,j) = fion(2,j) + dbl_mb(ftmp(1)+3*(j-1)+1)
1445 c fion(3,j) = fion(3,j) + dbl_mb(ftmp(1)+3*(j-1)+2)
1446 end do
1447 !$OMP END DO


Forum >> NWChem's corner >> Running NWChem