NWCHEM fails to run on multiple nodes


Clicked A Few Times
Hi,
I’m running NWCHEM on a large cluster using Qlogic IB; my install script is below. The install completes successfully and runs fine on 1 node using 16 processors (NWCHEM input file and run script are shown below). It even works fine across 4 nodes using all the processors (4 nodes*16 processors = 64 total processors); however, when I go to 5 or more nodes it fails with the following in the output file:
 argument  1 = h2o.nw
112:armci_ListenSockAll: listen failed: 0
32:armci_ListenSockAll: listen failed: 0
(rank:112 hostname:node31 pid:117359):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_ListenSockAll():614 cond:0
(rank:32 hostname:node26 pid:47485):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_ListenSockAll():614 cond:0
0:Terminate signal was sent, status=: 15
(rank:0 hostname:node24 pid:17912):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0


Could the problem be with the install of NWCHEM, a memory problem, or is it a Qlogic IB problem? Any help would be much appreciated.

Install script:
#!/bin/bash 
export NWCHEM_TOP=/home/nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
 
export ARMCI_NETWORK=SPAWN
export IB_HOME=/usr
export IB_INCLUDE=/usr/include/infiniband
export IB_LIB=/usr/lib64
export IB_LIB_NAME="-lrt -lnsl -lutil -lrdmacm -libumad -libverbs -ldl -lpthread -lm -Wl,--export-dynamic -lrt -lnsl -lutil"
export MSG_COMMS=MPI
export TCGRSH=/usr/bin/ssh
 
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/opt/openmpi-1.6-intel
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lpthread -lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
 
export NWCHEM_MODULES="all"
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export LIB_DEFINES=-DDFLT_TOT_MEM=259738112
export MKLROOT=/opt/intel-12.1/mkl
export MKLPATH=$MKLROOT/lib/intel64
 
export BLASOPT="-Wl,--start-group  $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread -lm"
 
export FC=ifort
export CC=icc

make nwchem_config
make FC=$FC CC=$CC


NWCHEM input file:
start h2o-test
title "test"
 
geometry units angstrom
 O                 -0.23615400    0.02856700    3.59321400
 H                 -0.46943300   -0.85241100    3.91668900
 H                 -1.02284300    0.57807900    3.71903800
end
 
basis
* library 6-31G
end
 
driver
maxiter 2000
end
 
dft
xc b3lyp
end
 
scratch_dir /scratch/NWCHEM
ecce_print ecce.out
 
task dft optimize


NWCHEM run script:
#!/bin/bash
#SBATCH --time=4:00:00
#SBATCH -N 4
#SBATCH --job-name h2o
 
nodes=4
cores=16
 
mpiexec -npernode $cores -n $(($cores*$nodes)) /home/nwchem-6.1.1-src/bin/LINUX64/nwchem h2o.nw > h2o.out

Clicked A Few Times
Quote:Mef362 Jan 7th 9:59 am
Hi,
I’m running NWCHEM on a large cluster using Qlogic IB; my install script is below. The install completes successfully and runs fine on 1 node using 16 processors (NWCHEM input file and run script are shown below). It even works fine across 4 nodes using all the processors (4 nodes*16 processors = 64 total processors); however, when I go to 5 or more nodes it fails with the following in the output file:
 argument  1 = h2o.nw
112:armci_ListenSockAll: listen failed: 0
32:armci_ListenSockAll: listen failed: 0
(rank:112 hostname:node31 pid:117359):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_ListenSockAll():614 cond:0
(rank:32 hostname:node26 pid:47485):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_ListenSockAll():614 cond:0
0:Terminate signal was sent, status=: 15
(rank:0 hostname:node24 pid:17912):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0


Could the problem be with the install of NWCHEM, a memory problem, or is it a Qlogic IB problem? Any help would be much appreciated.

Install script:
#!/bin/bash 
export NWCHEM_TOP=/home/nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
 
export ARMCI_NETWORK=SPAWN
export IB_HOME=/usr
export IB_INCLUDE=/usr/include/infiniband
export IB_LIB=/usr/lib64
export IB_LIB_NAME="-lrt -lnsl -lutil -lrdmacm -libumad -libverbs -ldl -lpthread -lm -Wl,--export-dynamic -lrt -lnsl -lutil"
export MSG_COMMS=MPI
export TCGRSH=/usr/bin/ssh
 
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/opt/openmpi-1.6-intel
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lpthread -lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
 
export NWCHEM_MODULES="all"
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export LIB_DEFINES=-DDFLT_TOT_MEM=259738112
export MKLROOT=/opt/intel-12.1/mkl
export MKLPATH=$MKLROOT/lib/intel64
 
export BLASOPT="-Wl,--start-group  $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread -lm"
 
export FC=ifort
export CC=icc

make nwchem_config
make FC=$FC CC=$CC


NWCHEM input file:
start h2o-test
title "test"
 
geometry units angstrom
 O                 -0.23615400    0.02856700    3.59321400
 H                 -0.46943300   -0.85241100    3.91668900
 H                 -1.02284300    0.57807900    3.71903800
end
 
basis
* library 6-31G
end
 
driver
maxiter 2000
end
 
dft
xc b3lyp
end
 
scratch_dir /scratch/NWCHEM
ecce_print ecce.out
 
task dft optimize


NWCHEM run script:
#!/bin/bash
#SBATCH --time=4:00:00
#SBATCH -N 4
#SBATCH --job-name h2o
 
nodes=4
cores=16
 
mpiexec -npernode $cores -n $(($cores*$nodes)) /home/nwchem-6.1.1-src/bin/LINUX64/nwchem h2o.nw > h2o.out


I could be wrong, but I think the issue in your case is that your test system is just too small (a water molecule) to run across 5 nodes with 16 processors each! Try a larger system, 20 heavy atoms or so

Clicked A Few Times
I have the same problem with PCBM (phenyl-C61-butyric acid methyl ester).

Clicked A Few Times
I found that I can run on more than 4 nodes if I only use 1 or 2 of the processors on each node (i.e 24 nodes x 1 processor per node). Also, if I set ARMCI_DEFAULT_SHMMAX (i.e. 4084) or even install with a different DDFLT_TOT_MEM value the number of nodes/processors I can use changes, but I can't seem to eliminate the problem altogether.

If I set DDFLT_TOT_MEM to 16777216, I can use 2 nodes at best. If I compiled with DDFLT_TOT_MEM=259738112 (tried larger values), the value I obtained from running the getmem.nwchem script, I can use ~4 nodes and all the processors. I best that I have been able to to is 6 nodes x 16 processors per node. I have the same problem when running DFT, HF, or MP2.

I guess it is some memory allocation problem, but I have no idea how to fix it... I want to run very large jobs 50+ nodes.

Suggestions are welcome,
Thanks

Forum Vet
How memory allocation works in NWChem
First of all, using the DDFLT_TOT_MEM environment variable and recompiling to set memory usage does not make sense. NWChem has an "memory" [input keyword] that allows you define the amount of memory used by each processor during a simulation.

The shared memory in the input for the memory keyword is the global memory, and that is associated with the settings for ARMCI_DEFAULT_SHMMAX, which should be about the size or a little larger than the amount of shared memory of all the nodes that is being used.

By default, if no memory is allocated per processor in the input, it will use the precompiled default. Looking at the 259738112 from Mef362, (this is in doubles), Mef362 has 2 Gbyte available per core. By default, this will be split into 25% heap, 25% stack and 50% global. So, 1 Gbyte of global or shared per processor.

Now lets get back to the ARMCI_DEFAULT_SHMMAX. If you have X cores running on a node, and for each core you specify the shared memory to be Y (this is the global memory in the input, which is per core), your ARMCI_DEFAULT_SHMMAX should be set to X*Y, and this number should be smaller than the shmmax set in the kernel. In the current released version there is a present maximum that is allowed for ARMCI_DEFAULT_SHMMAX, which is 8 Gbyte. We will address this in the next release, of if you know how to code in c I can provide you with the code to change (bert.dejong@pnl.gov).

So, for Mef262 I would recommend:

1. In the input use:
   
memory heap 100 mb stack 500 mb global 500 mb

2. Set ARMCI_DEFAULT_SHMMAX to 8092

From a hardware point of view, you also need to make sure that the system parameters have been set to use shared memory segments that are 8 Gbyte in size.

ARMCI_DEFAULT_SHMMAX has to be less or equal than kernel.shmmax.

For example, if the value of kernel.shmmax is 4294967296 as in the example below,
ARMCI_DEFAULT_SHMMAX can be at most 4096 (4294967296=4096*1024*1024)

$ sysctl kernel.shmmax
kernel.shmmax = 4294967296

Hence, make sure that your kernel.shmmax is at least 8092*1024*1024.

Bert

Clicked A Few Times
Thanks for the help, I really appropriate it. However, I still can not run on more then 4 nodes with all the processors. I have set ARMCI_DEFAULT_SHMMAX to 8092 in my run script and added memory heap 100 mb stack 500 mb global 500 mb to my NWCHEM input file.

Some system info:
OS - RHEL 6
8 cores/socket (2 sockets per node)
32 GB memory per node
Infiniband 4X QDR, Fat Tree, Qlogic chipset

kernel.shmmax = 20971520000

The error again:
 argument  1 = PCBM.nw
48:armci_ListenSockAll: listen failed: 0
(rank:48 hostname:node31 pid:44518):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_ListenSockAll():614 cond:0
0:Terminate signal was sent, status=: 15
(rank:0 hostname:node28 pid:106662):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0

Clicked A Few Times
Well, I just installed NWCHEM on one of our other cluster (not as powerful) using basically the same install script as shown above and it seems to be working great! The only main difference is that I used OPENIB not SPAWN because of the hardware/software.

Should this line be different for SPAWN:
export IB_LIB_NAME="-lrt -lnsl -lutil -lrdmacm -libumad -libverbs -ldl -lpthread -lm -Wl,--export-dynamic -lrt -lnsl -lutil"

Forum Vet
MPI-SPAWN
Just realized this, you are using SPAWN and not MPI-SPAWN . You should try MPI-SPAWN. I'm surprised it compiled with SPAWN, and to me it's unclear with what kind of network it actually did compile.

For MPI-SPAWN the IB_LIB_NAME etc are not needed, those are only used when the OPENIB network is defined.

Bert

Clicked A Few Times
Thanks Bert. I must have messed that up after changing it a few time. However, when I recompile with MPI-SPAWN and try to run a calculation I get an error stating that the network is down:

node46.5433can't open /dev/ipath, network down (err=26)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning. 
 
Error: Could not detect network connectivity


I guess I will have to ask my admin about this...

Thanks again

Forum Vet
When ARMCI_NETWORK is set to MPI-SPAWN, NWChem's GA/ARMCI parallel infrastructure uses MPI-2's dynamic process management to spawn additional processes for communication (communication co-processes).

Due of this you need to make sure you compile NWChem using an MPI implementation that supports dynamic process management (i.e. MPI_Comm_spawn_multiple).

Bert

Clicked A Few Times
Is there a way to determine if the version of openmpi I'm using supports dynamic process management?

Forum Vet
This would require the MPI library to have MPI_Comm_spawn_multiple. So, you could check in the library files if this function exists.

http://www.emsl.pnl.gov/docs/global/support.shtml#spawn

Clicked A Few Times
Hi Bert,

My admin asked me to check if NWCHEM is supported on RedHat6.2; I searched but I could not find anything.

Thanks


Forum >> NWChem's corner >> Running NWChem