I hope somebody can help me with a running (or maybe kompiling?) problem of NWChem 6.1.1.
I have to say first that I'm just the system administrator who kompiled the software for some users and try now to run a simple job to test the installation. So I don't have the background of a "real" NWChem user.
Everything is ok as long as the job is running on one host but always breaks otherwise at CCSD(T) calculations with ARMCI errors like
…
================================================================
the segmented parallel ccsd program: 24 nodes
================================================================
level of theory ccsd(t)
number of core 8
number of occupied 22
number of virtual 241
number of deleted 0
total functions 274
number of shells 122
basis label 566
==== ccsd parameters ====
iprt = 0
convi = 0.100E-07
maxit = 35
mxvec = 5
memory 235792780
IO offset 20.0000000000000
IO error message >End of File
file_read_ga: failing writing to /cl_tmp/winkl/1a_DC_CCSDT.t2
Failed reading restart vector from /cl_tmp/winkl/1a_DC_CCSDT.t2
Using MP2 initial guess vector
-------------------------------------------------------------------------
iter correlation delta rms T2 Non-T2 Main
energy energy error ampl ampl Block
time time time
-------------------------------------------------------------------------
0: error ival=4
(rank:0 hostname:f38 pid:26622):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
Submit File (job scheduler is Son of Grid Engine):
…
#export MA_USE_ARMCI_MEM=YES
export ARMCI_DEFAULT_SHMMAX=16384
cd $TMPDIR
mpiexec /software/nwchem/nwchem-6.1.1/bin/LINUX64/nwchem /cl_tmp/winkl/1a_CCSDT_DZ.nw > /cl_tmp/winkl/1a_CCSDT_DZ.out
NWChem Input File:
ECHO
START 1a_DC_CCSDT
PERMANENT_DIR /cl_tmp/winkl
SCRATCH_DIR /tmp
TITLE "Benzene + MeOH (MP2/6-31+G(d,p) geometry, CCSD(T)/aug-cc-pVDZ BSSE"
MEMORY stack 1700 heap 100 global 1700 MB
Geometry noautoz noautosym
C -0.037516 1.497715 -0.368972
C 1.183399 0.816740 -0.443203
C 1.201236 -0.579954 -0.530828
C -0.001026 -1.295847 -0.545287
C -1.221398 -0.615635 -0.466875
C -1.240076 0.781046 -0.379227
H -0.051502 2.578197 -0.295989
H 2.113885 1.370596 -0.425910
H 2.146259 -1.106454 -0.585914
H 0.013113 -2.376935 -0.612861
H -2.152107 -1.169760 -0.472446
H -2.184250 1.307292 -0.312447
O 0.066650 0.168253 2.832692
C 0.093368 -1.236539 3.073676
H 0.039393 0.309660 1.877253
H -0.799336 -1.731543 2.683798
H 0.123668 -1.363055 4.152641
H 0.978240 -1.705779 2.636901
END
basis "ao basis" spherical
C library aug-cc-pVDZ
H library aug-cc-pVDZ
O library aug-cc-pVDZ
bqC library C aug-cc-pVDZ
bqH library H aug-cc-pVDZ
bqO library O aug-cc-pVDZ
end
SCF
DIRECT
THRESH 1.0E-8
END
ccsd
freeze core atomic
thresh 1d-8
maxiter 35
end
Varying stack, heap or global and ARMCI_DEFAULT_SHMMAX does not really change anything (if I set them low, then another error occurs). Setting MA_USE_ARMCI_MEM = y/n does not have any effect.
The cluster on which I try to run the program has 12 cores/node and 4G Mem/Core and Mellanox Infiniband. The job scheduler allows to use 45G mem per node.
The memory system settings on each host are:
/etc/security/limits.conf: memlock and stack: unlimited (soft and hard)
/proc/sys/kernel/shmmax:
68719476736
Infiniband registerable memory: 128 GiB/node
(PAGE_SIZE = 4096 Bytes, log_num_mtt = 25, log_mtts_per_seg = 0; I did not change log_mtts_per_seg because Mellanox has advised the OpenMPI community not to change the default setting)
The software is kompiled with Intel Composer XE 13 and OpenMPI 1.6.4.
I have to say that the Intel Kompiler and OpenMPI run well on the cluster.
Kompilation settings/options:
source /software/Intel/composer_xe_2013.3.163/mkl/bin/mklvars.csh
setenv FC ifort
setenv CC icc
setenv NWCHEM_TOP /software/nwchem/nwchem-6.1.1
setenv NWCHEM_TARGET LINUX64
setenv TARGET LINUX64
setenv NWCHEM_MODULES all
setenv USE_MPI y
setenv USE_MPIF y
setenv MPI_LOC /software/openmpi/1.6.4/intel
setenv MPI_LIB $MPI_LOC/lib
setenv MPI_INCLUDE $MPI_LOC/include
setenv LARGE_FILES TRUE
setenv IB_HOME /usr
setenv IB_LIB /usr/lib64
setenv IB_INCLUDE /usr/include
setenv IB_LIB_NAME "-libumad -libverbs -lpthread"
setenv IBV_FORK_SAFE 1
setenv LIBMPI "-lmpi_f90 -lmpi_f77 -lmpi -ldl -lm -Wl,--export-dynamic -lrt -lnsl -lutil"
setenv HAS_BLAS y
setenv BLAS_SIZE 8
setenv BLAS_LIB "-L$MKLROOT/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm"
setenv BLASOPT "$BLAS_LIB"
setenv OMP_NUM_THREADS 1
setenv ARMCI_NETWORK OPENIB
I also applied the following patches before kompiling:
Giaxyz.patch.gz, Texasmem2.patch.gz, Dstebz3.patch.gz
The kompilation exits successfully. Afterwards I applied the memory script:
cd $NWCHEM_TOP/contrib
./getmem.nwchem
Total Memory : 49419604 Kb
No. of processors : 12
Total Memory/proc : 4118300 KB = 4 GB
Executing make LIB_DEFINES+=" -DDFLT_TOT_MEM=523210240"/software/nwchem/nwchem-6.1.1/bin/LINUX64/depend.x
-I/software/nwchem/nwchem-6.1.1/src/tools/install/include > dependencies
ifort -i8 -I/software/Intel/composer_xe_2013.3.163/mkl/include
-I/software/Intel/composer_xe_2013.3.163/mkl/mkl/include -c -i8 -g
-I. -I/software/nwchem/nwchem-6.1.1/src/include
-I/software/nwchem/nwchem-6.1.1/src/tools/install/include -DEXT_INT -DLINUX
-DLINUX64 -DPARALLEL_DIAG -DDFLT_TOT_MEM=523210240 memory_def.F
...
ldd /software/nwchem/nwchem-6.1.1/bin/LINUX64/nwchem:
linux-vdso.so.1 => (0x00007fff69bff000)
libmkl_intel_ilp64.so => /software/Intel/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_ilp64.so (0x00002b619cb92000)
libmkl_sequential.so => /software/Intel/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_sequential.so (0x00002b619d2a9000)
libmkl_core.so => /software/Intel/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_core.so (0x00002b619d956000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003565600000)
libm.so.6 => /lib64/libm.so.6 (0x0000003a6ae00000)
libmpi_f90.so.1 => /software/openmpi/1.6.4/intel/lib/libmpi_f90.so.1 (0x00002b619ebcb000)
libmpi_f77.so.1 => /software/openmpi/1.6.4/intel/lib/libmpi_f77.so.1 (0x00002b619edce000)
libmpi.so.1 => /software/openmpi/1.6.4/intel/lib/libmpi.so.1 (0x00002b619f006000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003a6a200000)
librt.so.1 => /lib64/librt.so.1 (0x0000003566e00000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003a7aa00000)
libutil.so.1 => /lib64/libutil.so.1 (0x00000033c2600000)
libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x0000003a9b800000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000003a9b400000)
libc.so.6 => /lib64/libc.so.6 (0x0000003a6a600000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003a75600000)
/lib64/ld-linux-x86-64.so.2 (0x0000003a69e00000)
libifport.so.5 => /software/Intel/composer_xe_2013.3.163/kompiler/lib/intel64/libifport.so.5 (0x00002b619f405000)
libifcore.so.5 => /software/Intel/composer_xe_2013.3.163/kompiler/lib/intel64/libifcore.so.5 (0x00002b619f634000)
libimf.so => /software/Intel/composer_xe_2013.3.163/kompiler/lib/intel64/libimf.so (0x00002b619f96a000)
libintlc.so.5 => /software/Intel/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 (0x00002b619fe27000)
libsvml.so => /software/Intel/composer_xe_2013.3.163/kompiler/lib/intel64/libsvml.so (0x00002b61a0075000)
libifcoremt.so.5 => /software/Intel/composer_xe_2013.3.163/compiler/lib/intel64/libifcoremt.so.5 (0x00002b61a0a3f000)
libirng.so => /software/Intel/composer_xe_2013.3.163/kompiler/lib/intel64/libirng.so (0x00002b61a0da5000)
So what could be the reason for the failure? Any help would be appreciated.
Ursula
|