NWCHEM fails to run on multiple nodes


Click here for full thread
Clicked A Few Times
Hi,
I’m running NWCHEM on a large cluster using Qlogic IB; my install script is below. The install completes successfully and runs fine on 1 node using 16 processors (NWCHEM input file and run script are shown below). It even works fine across 4 nodes using all the processors (4 nodes*16 processors = 64 total processors); however, when I go to 5 or more nodes it fails with the following in the output file:
 argument  1 = h2o.nw
112:armci_ListenSockAll: listen failed: 0
32:armci_ListenSockAll: listen failed: 0
(rank:112 hostname:node31 pid:117359):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_ListenSockAll():614 cond:0
(rank:32 hostname:node26 pid:47485):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_ListenSockAll():614 cond:0
0:Terminate signal was sent, status=: 15
(rank:0 hostname:node24 pid:17912):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0


Could the problem be with the install of NWCHEM, a memory problem, or is it a Qlogic IB problem? Any help would be much appreciated.

Install script:
#!/bin/bash 
export NWCHEM_TOP=/home/nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
 
export ARMCI_NETWORK=SPAWN
export IB_HOME=/usr
export IB_INCLUDE=/usr/include/infiniband
export IB_LIB=/usr/lib64
export IB_LIB_NAME="-lrt -lnsl -lutil -lrdmacm -libumad -libverbs -ldl -lpthread -lm -Wl,--export-dynamic -lrt -lnsl -lutil"
export MSG_COMMS=MPI
export TCGRSH=/usr/bin/ssh
 
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/opt/openmpi-1.6-intel
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lpthread -lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
 
export NWCHEM_MODULES="all"
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export LIB_DEFINES=-DDFLT_TOT_MEM=259738112
export MKLROOT=/opt/intel-12.1/mkl
export MKLPATH=$MKLROOT/lib/intel64
 
export BLASOPT="-Wl,--start-group  $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread -lm"
 
export FC=ifort
export CC=icc

make nwchem_config
make FC=$FC CC=$CC


NWCHEM input file:
start h2o-test
title "test"
 
geometry units angstrom
 O                 -0.23615400    0.02856700    3.59321400
 H                 -0.46943300   -0.85241100    3.91668900
 H                 -1.02284300    0.57807900    3.71903800
end
 
basis
* library 6-31G
end
 
driver
maxiter 2000
end
 
dft
xc b3lyp
end
 
scratch_dir /scratch/NWCHEM
ecce_print ecce.out
 
task dft optimize


NWCHEM run script:
#!/bin/bash
#SBATCH --time=4:00:00
#SBATCH -N 4
#SBATCH --job-name h2o
 
nodes=4
cores=16
 
mpiexec -npernode $cores -n $(($cores*$nodes)) /home/nwchem-6.1.1-src/bin/LINUX64/nwchem h2o.nw > h2o.out