SCF Performance for Different ARMCI Network on Socket-based KNL Cluster


Click here for full thread
Clicked A Few Times
Hi All,

I have been evaluating which ARMCI_NTEWORK most suitable for a socket-based KNL Cluster; the processor type is KNL 7210 @ 1.30GHz, and each node has 192GB memory, OS is RedHat 7.3.

ARMCI_NETWORK=MPI-PR, OpenIB and ARMCI-MPI+CASPER have been evaluated. All 3 copies of NWChem were compiled using the same set of environmental variable, the script is given the the bottom part of this thread. Intel Compiler v17.0.1 and Intel MPI Version 2017 Update 1 Build 20161016 were used for the compilation.

Initially I would like to evaluate the performance for a TCE:CCSD(T) calculation of a pentacene molecule taken which have been used for NWChem the benchmark (http://nwchemgit.github.io/images/Pentacene_ccsdt.nw), however the SCF timing result for the ARMCI-MPI+Casper is too slow compared with other ARMCI network, so I would like to check if any part of my compilation or runtime variable are not correct before going further to CCSD level:


For MPI-PR
            iter       energy          gnorm     gmax       time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 49.8
2 -841.3354832032 1.14D-03 1.43D-04 89.9
3 -841.3354832406 1.57D-06 2.64D-07 113.7
Time for solution = 111.5s

For OpenIB
           iter       energy          gnorm     gmax       time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 53.4
2 -841.3354832032 1.14D-03 1.43D-04 95.9
3 -841.3354832406 1.57D-06 2.64D-07 121.1
Time for solution = 116.4s

For ARMCI-MPI
           iter       energy          gnorm     gmax       time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 236.9
2 -841.3354832032 1.14D-03 1.43D-04 923.4
3 -841.3354832406 1.58D-06 2.64D-07 1211.7
Time for solution = 1263.4s


Thanks!

Regards,
Dominic

Note:
Here is some additional information for the runtime environment, input, compilation environment and the snapshot of the calculation outputs:


(1) The runtime environment was set by this script environment.sh
module load intel
module load impi/2017_Update_1

WORKDIR=/home/users/astar/ihpc/chiensh/nwtest
export EXE1=/home/users/astar/ihpc/chiensh/nwchem-mpipr/bin/LINUX64/nwchem
export EXE2=/home/users/astar/ihpc/chiensh/nwchem-openib/bin/LINUX64/nwchem
export EXE3=/home/users/astar/ihpc/chiensh/nwchem-armci-mpi/bin/LINUX64/nwchem
export CASPER=/home/users/astar/ihpc/chiensh/local/lib/libcasper.so

ulimit -s unlimited
#export OMP_STACKSIZE=32M
export I_MPI_PIN=1
#export I_MPI_DEBUG=4
unset I_MPI_DEBUG
export KMP_BLOCKTIME=1
#export KMP_AFFINITY=scatter,verbose
export KMP_AFFINITY=scatter
export ARMCI_USE_WIN_ALLOCATE=1

export OMP_NUM_THREADS=60
#export MKL_NUM_THREADS=2
export MKL_DYNAMIC=false
export OMP_NESTED=true


(2) There are some modifications on the tce and scf section and they are shown below
scf
vectors input pentacene.movecs
thresh 1e-4
#semidirect filesize 0 memsize 1000000000
end

tce
freeze atomic
ccsd(t)
thresh 1e-4
io ga
tilesize 50
attilesize 40
2EORB
2EMET 15
end

set tce:nts T
set tce:tceiop 2048
#set tce:writeint T
set tce:readint T
#set tce:writet T
set tce:readt T
set tce:xmem 1000
task tce energy

(3) Snapshot of the outputs
i. For ARMCI_NETWORK=MPI-PR
> mpirun -perhost 2                                            -n 40  $EXE1 ccsdt.nw 

          Job information
---------------
hostname = r1i0n4
program = /home/users/astar/ihpc/chiensh/nwchem-mpipr/bin/LINUX64/nwchem
date = Fri Mar 31 17:38:54 2017
compiled = Fri_Mar_31_15:02:17_2017
source = /home/users/astar/ihpc/chiensh/nwchem-mpipr
nwchem branch = 6.6
nwchem revision = 27746
ga revision = 10594
input = ccsdt.nw
prefix = pentacene.
data base = ./pentacene.db
status = startup
nproc = 20
time left = -1s

Memory information
------------------
heap = 65536000 doubles = 500.0 Mbytes
stack = ********** doubles = 90000.0 Mbytes
global = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)
total = ********** doubles = 190500.0 Mbytes
verify = no
hardfail = no
...
NWChem SCF Module
-----------------
ao basis = "ao basis"
functions = 378
atoms = 36
closed shells = 73
open shells = 0
charge = 0.00
wavefunction = RHF
input vectors = ./pentacene.movecs
output vectors = ./pentacene.movecs
use symmetry = F
symmetry adapt = F
...

----------------------------------------------
Quadratically convergent ROHF
Convergence threshold  : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------

Integral file = ./pentacene.aoints.00
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 3937 Max. records in file = 16943190
No. of bits per label = 16 No. of bits per value = 64

#quartets = 3.409D+07 #integrals = 3.020D+08 #direct = 0.0% #cached =100.0%

File balance: exchanges= 68 moved= 1370 time= 0.2

iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 49.8
2 -841.3354832032 1.14D-03 1.43D-04 89.9
3 -841.3354832406 1.57D-06 2.64D-07 113.7

Final RHF results
------------------
Total SCF energy = -841.335483240608
One-electron energy = -4113.018438040344
Two-electron energy = 1774.091366579281
Nuclear repulsion energy = 1497.591588220455

Time for solution = 111.5s


ii.ARMCI_NETWORK=OpenIB
>mpirun -perhost 1                                            -n 20  $EXE2 ccsdt.nw

          Job information
---------------
hostname = r1i0n4
program = /home/users/astar/ihpc/chiensh/nwchem-openib/bin/LINUX64/nwchem
date = Fri Mar 31 16:08:37 2017
compiled = Thu_Mar_30_19:10:59_2017
source = /home/users/astar/ihpc/chiensh/nwchem-openib
nwchem branch = Development
nwchem revision = 27746
ga revision = 10594
input = ccsdt.nw
prefix = pentacene.
data base = ./pentacene.db
status = startup
nproc = 20
time left = -1s



Memory information
------------------
heap = 65535998 doubles = 500.0 Mbytes
stack = ********** doubles = 90000.0 Mbytes
global = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)
total = ********** doubles = 190500.0 Mbytes
verify = no
hardfail = no
...
----------------------------------------------
Quadratically convergent ROHF

Convergence threshold  : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------


Integral file = ./pentacene.aoints.00
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 3937 Max. records in file = 16908036
No. of bits per label = 16 No. of bits per value = 64


#quartets = 3.409D+07 #integrals = 3.020D+08 #direct = 0.0% #cached =100.0%


File balance: exchanges= 77 moved= 1081 time= 0.1


iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 53.4
2 -841.3354832032 1.14D-03 1.43D-04 95.9
3 -841.3354832406 1.57D-06 2.64D-07 121.1


Final RHF results
------------------

Total SCF energy = -841.335483240607
One-electron energy = -4113.018438056843
Two-electron energy = 1774.091366595781
Nuclear repulsion energy = 1497.591588220455

Time for solution = 116.4s


iii.ARMCI_NETWORK=ARMCI-MPI
>mpirun -perhost 1 -genv CSP_NG 1  -genv LD_PRELOAD $CASPER   -n 40  $EXE3 ccsdt.nw

          Job information
---------------

hostname = r1i0n4
program = /home/users/astar/ihpc/chiensh/nwchem-armci-mpi/bin/LINUX64/nwchem
date = Fri Mar 31 15:20:48 2017
compiled = Fri_Mar_31_14:18:43_2017
source = /home/users/astar/ihpc/chiensh/nwchem-armci-mpi
nwchem branch = 6.6
nwchem revision = 27746
ga revision = 10594
input = ccsdt.nw
prefix = pentacene.
data base = ./pentacene.db
status = startup
nproc = 20
time left = -1s



Memory information
------------------
heap = 65535998 doubles = 500.0 Mbytes
stack = ********** doubles = 90000.0 Mbytes
global = ********** doubles = 100000.0 Mbytes (distinct from heap & stack)
total = ********** doubles = 190500.0 Mbytes
verify = no
hardfail = no
...
----------------------------------------------
Quadratically convergent ROHF

Convergence threshold  : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------


Integral file = ./pentacene.aoints.00
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 3937 Max. records in file = 16943273
No. of bits per label = 16 No. of bits per value = 64


#quartets = 3.409D+07 #integrals = 3.020D+08 #direct = 0.0% #cached =100.0%


File balance: exchanges= 67 moved= 1250 time= 0.1


iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 -841.3354871523 2.54D-03 3.19D-04 236.9
2 -841.3354832032 1.14D-03 1.43D-04 923.4
3 -841.3354832406 1.58D-06 2.64D-07 1211.7


Final RHF results
------------------

Total SCF energy = -841.335483240606
One-electron energy = -4113.018438085244
Two-electron energy = 1774.091366624183
Nuclear repulsion energy = 1497.591588220455

Time for solution = 1263.4s



iv. NWChem were built by the same script build.sh
#export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-mpipr
#export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-openib
export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-armci-mpi
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=ARMCI
export EXTERNAL_ARMCI_PATH=/home/users/astar/ihpc/chiensh/local
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export ENABLE_COMPONENT=yes
export USE_OPENMP=y
export LIB_DEFINES=" -DDFLT_TOT_MEM=391422080"
export DISABLE_GAMIRROR=y
export GA_STABLE=ga-5-5

export BLAS_SIZE=8
export BLASOPT=" -Wl,--start-group -L$MKLROOT/lib/intel64 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -Wl,--end-group -liomp5 -lpthread -lm -ldl -qopenmp "
#export BLASOPT="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -qopenmp -lpthread "
export LAPACK_LIB=$BLASOPT
#export LAPACK_LIB="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -qopenmp -lpthread"

export USE_SCALAPACK=yes
export SCALAPACK_SIZE=8
export SCALAPACK="-L/$MKLROOT/lib/intel64 -lmkl_blacs_intelmpi_ilp64 -lmkl_scalapack_ilp64 "
export SCALAPACK_LIB="$SCALAPACK $BLASOPT"
#export SCALAPACK="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mkl/lib/intel64 -lmkl_avx512 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 "
#export SCALAPACK_LIB="$SCALAPACK -qopenmp -lpthread"

#unset SCALAPACK
#unset SCALAPACK_SIZE
#unset SCALAPACK_LIB

#export IB_HOME=/usr
#export IB_INCLUDE=$IB_HOME/include/infiniband
#export IB_LIB=$IB_HOME/lib64
#export IB_LIB_NAME="-libumad -libverbs -lpthread"

export MPI_INCLUDE="-I/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mpi/intel64/include "
export LIBMPI="-L/home/users/app/intel/compilers_and_libraries_2017.1.132/linux/mpi/intel64/lib -lmpi -lmpifort -lmpi_ilp64 -lpthread"

export USE_PYTHON64=y
export PYTHONVERSION=2.7
export PYTHONHOME=/usr
export PYTHONLIBTYPE=so

export NWCHEM_MODULES="all python"
#export NWCHEM_MODULES="all"
#export NWCHEM_MODULES=smallqm

export CCSDTQ=y
export CCSDTLR=y
export MRCC_METHODS=TRUE
#unset CCSDTQ
#unset CCSDTLR
#unset MRCC_METHODS

export CC=icc
export FC=ifort
export F77=ifort
export F90=ifort
export CXX=icpc
export MPICC=icc
export MPIFC=ifort
export FOPTIMIZE="-O3 -qopenmp -heap-arrays 64 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -funroll-loops -unroll-aggressive "
export COPTIMIZE="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -funroll-loops "

v. ARMCI-MPI and Casper were build by these configure command
a ARMCI-MPI
 ../configure MPICC=mpiicc CFLAGS=-L/home/users/astar/ihpc/chiensh/local/lib  MPIEXEC=mpirun --enable-explicit-progress --prefix=/home/users/astar/ihpc/chiensh/local

v. CASPER
../configure --prefix=/home/users/astar/ihpc/chiensh/local CC=mpiicc