NWChem (development version) on KNL+Infiniband


Click here for full thread
Clicked A Few Times
Hi,

I just downloaded the latest copy of the development copy of NWChem yesterday (17 Aug 2017), and able to build the code on our socket based KNL system, which is interconnected with Mellanox EDR IB, using the Intel compilers (v.17 update 4), MKL and Intel MPI after some modification of the code of /home/users/astar/ihpc/chiensh/nwchem/src/nwpw/pspw/lib/psp/psp.F (which should not be related to error that I am going to report here, and I have reported the need of this modification previously in a different threads "SCF Performance for Different ARMCI Network on Socket-based KNL Cluster" on April)

I attempted to evaluate the performance of single point calculation at the level of CCSD(T)/cc-pvqz for a (H2O)8 (D2d, but set to C1 in the calculation) using 80 nodes (each has 192GB memory: 16MPI tasks on each node, and each MPI task has 4 OMP threads). I would like to evaluate the difference and resource requirement, as well as the cpu time difference for direct and TCE calculations.

However, I got error messages and the job crashed; for direct CCSD(T) calculation, I got these error messages:
Parallel integral file used  937379 records with       0 large values

257: WARNING:armci_set_mem_offset: offset changed 0 to 268435456
258: WARNING:armci_set_mem_offset: offset changed 0 to 268435456
260: WARNING:armci_set_mem_offset: offset changed 0 to 268435456
...
1213: WARNING:armci_set_mem_offset: offset changed 0 to 268435456
1214: WARNING:armci_set_mem_offset: offset changed 0 to 268435456
Not enough mem for st2. Increase MA by 550
MB
Not Enough Memory to keep ST2 in local
memory - expect network congestion.
mlx5: r1i0n12: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
000000f8 00000000 00000000 00000000
00000000 9d003304 100034ac 49af7bd3
112: error ival=4
(rank:112 hostname:r1i0n12 pid:88063):ARMCI DASSERT fail. ../../ga-
5.6.1/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS) application called MPI_Abort(comm=0x84000007, 1) - process 112

For TCE cases, the job crashed with different error messages:
 Fock matrix recomputed
1-e file size = 831744
1-e file name = ./rubbish.w8.f1int.000000
Cpu & wall time / sec 48.2 28.0
4-electron integrals stored in orbital form
610: WARNING:armci_set_mem_offset: offset changed 0 to 280342528
611: WARNING:armci_set_mem_offset: offset changed 0 to 280342528
...
433: WARNING:armci_set_mem_offset: offset changed 0 to 268435456
434: WARNING:armci_set_mem_offset: offset changed 0 to 268435456

v2 file size = 93246499391
4-index algorithm nr. 15 is used
imaxsize = 40
imaxsize ichop = 0
starting step 1 at 317.43 secs
0:CreateSharedRegion:kr_malloc failed KB=: -1125173
(rank:0 hostname:r1i1n1 pid:169724):ARMCI DASSERT fail. ../../ga-
5.6.1/armci/src/memory/shmem.c:Create_Shared_Region():1209 cond:0
Last System Error Message from Task 0:: No such file or directory
912:CreateSharedRegion:kr_malloc failed KB=: -1566421
(rank:912 hostname:r1i1n31 pid:49985):ARMCI DASSERT fail. ../../ga-
5.6.1/armci/src/memory/shmem.c:Create_Shared_Region():1209 cond:0
736:CreateSharedRegion:kr_malloc failed KB=: -1566421


Additional Notes:
1)All compilation settings, inputs and output will be given in the bottom of the message
2) OpenIB is used for ARMCI_NETWORK here
3) I also build ARMCI_NETWORK = MPIPR and MPI3, but each of these build has some problems
4) For MPIPR build, it can finish both direct and TCE calculations, but the (T) correlation energies
seems to be different in my previous build of development copy obtained on April, I am going to
verify if this is the case in this copy
5) For MPI3 build, the job will hang right after printing the basis set information
6) For OPENIB build, in additional to the error I am going to report here, GA seems to be unable to
map the process based on Intel MPI and PBS Pro's mechanism (i.e. mpirun -np 1280 -ppn 16 will
not works; I have to assigned the number of tasks on each node in a hostfile)


Here is the input for both calculations, the memory is sufficient to complete the job for the build using ARMCI_NETWORK=mpipr
echo
start rubbish.w8
#scratch_dir /dev/shm/chiensh
#permanent_dir /home/users/astar/ihpc/chiensh/nwtest
memory stack 3000 mb heap 500 mb global 8000 mb noverify
#print medium "task time" "ga stats" "ma stats" "version" "rtdbvalues"
geometry units angstrom noautoz noprint
#---------------
#Octamer *** D2d
#---------------
O -1.46966769 1.46966769 1.34326600
O 1.46966769 -1.46966769 1.34326600
O 1.46966769 1.46966769 -1.34326600
O -1.46966769 -1.46966769 -1.34326600
O -1.36565412 1.36565412 -1.32090835
O 1.36565412 -1.36565412 -1.32090835
O 1.36565412 1.36565412 1.32090835
O -1.36565412 -1.36565412 1.32090835
H -2.10464162 2.10464162 1.68605609
H 2.10464162 -2.10464162 1.68605609
H 2.10464162 2.10464162 -1.68605609
H -2.10464162 -2.10464162 -1.68605609
H -1.52398844 1.52398844 .35383543
H 1.52398844 -1.52398844 .35383543
H 1.52398844 1.52398844 -.35383543
H -1.52398844 -1.52398844 -.35383543
H 1.51043211 .42340972 1.51629003
H -1.51043211 -.42340972 1.51629003
H .42340972 -1.51043211 -1.51629003
H -.42340972 1.51043211 -1.51629003
H -1.51043211 .42340972 -1.51629003
H 1.51043211 -.42340972 -1.51629003
H -.42340972 -1.51043211 1.51629003
H .42340972 1.51043211 1.51629003
symmetry group c1
end

basis "ao basis" spherical noprint
* library cc-pvqz
end

scf
# vectors input w8.movecs
# semidirect memsize 100000000 filesize 0
singlet
rhf
thresh 1e-7
tol2e 1e-14
end

#ccsd
#diisbas 2
#thresh 1e-4
#MAXITER 50
#freeze atomic
#nodisk
#end

#set ccsd:use_trpdrv_nb T
#set ccsd:use_ccsd_omp T
#set ccsd:use_trpdrv_omp T

#task ccsd(t) energy

tce
freeze atomic
ccsd(t)
thresh 1e-4
io ga
attilesize 40
tilesize 35
2EORB
2EMET 15
#split 8

print very
end

set tce:nts T
set tce:tceiop 1024
set tce:writeint T
set tce:writet T
set tce:xmem 100
task tce energy


Here is the variables I used to build NWChem
module load  intel/17.0.4.196
export NWCHEM_TOP=/home/users/astar/ihpc/chiensh/nwchem-29377
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=OPENIB
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export ENABLE_COMPONENT=yes
export USE_OPENMP=y
export LIB_DEFINES=" -DDFLT_TOT_MEM=391422080"
export DISABLE_GAMIRROR=y
#export USE_GAGITHUB=y
export USE_GAGITHUB=n
export USE_KNL=T
export USE_F90_ALLOCATABLE=T

export BLAS_SIZE=8
export BLASOPT="-L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl"
export LAPACK_LIB=$BLASOPT

export USE_SCALAPACK=yes
export SCALAPACK_SIZE=8
export SCALAPACK="$BLASOPT -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 "
export SCALAPACK_LIB=$SCALAPACK

export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include/infiniband
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libumad -libverbs -lpthread"

export MPI_INCLUDE="-I/home/users/app/intel_psxe_2017_update4/impi/2017.3.196/intel64/include "
export LIBMPI="-L/home/users/app/intel_psxe_2017_update4/impi/2017.3.196/intel64/lib -lmpifort -lmpi_ilp64 -lmpi_mt -
lmpi -lpthread"

export USE_PYTHON64=y
export PYTHONVERSION=2.7
export PYTHONHOME=/usr
export PYTHONLIBTYPE=so

export NWCHEM_MODULES="all python"
#export NWCHEM_MODULES="all"
#export NWCHEM_MODULES=smallqm

export CCSDTQ=y
export CCSDTLR=y
export MRCC_METHODS=TRUE
#unset CCSDTQ
#unset CCSDTLR
#unset MRCC_METHODS

export CC=icc
export FC=ifort
export F77=ifort
export F90=ifort
export CXX=icpc
#export MPICC=gcc
#export MPIFC=ifort

This is the config I used to build GA 5.6.1
../ga-5.6.1/configure --prefix=/home/users/astar/ihpc/chiensh/nwchem-29377/src/tools/install  --with-tcgmsg --with-mpi="-I/home/users/app/intel_psxe_2017_update4/impi/2017.3.196/intel64/include -L/home/users/app/intel_psxe_2017_update4/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib/release_mt -L/home/users/app/intel_psxe_2017_update4/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib -L/home/users/app/intel_psxe_2017_update4/impi/2017.3.196/intel64/lib -lmpifort -lmpi_ilp64 -lmpi_mt -lmpi -lpthread " --enable-peigs --enable-underscoring --disable-mpi-tests --with-scalapack8="-L/home/users/app/intel_psxe_2017_update4/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64" --with-lapack="-L/home/users/app/intel_psxe_2017_update4/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl" --with-blas8="-L/home/users/app/intel_psxe_2017_update4/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -ldl"  --with-openib=/usr/include/infiniband /usr/lib64 -libumad -libverbs -lpthread" CC=icc CXX=icpc F77=ifort FFLAGS=-no-vec CFLAGS=-no-vec INTEL_64ALIGN=1 ARMCI_DEFAULT_SHMMAX_UBOUND=131072