Running with OpenMPI on multiple nodes failing


Clicked A Few Times
Hi;

I've built NWChem with OpenMPI 1.6.5 support according to the docs. I run using mpirun. The cluster has 16-core nodes. However, once PBS schedules the jobs, all the process (try to) run on the first node, rather than 16 per node. So even thigh I have, say, two nodes assigned by the scheduler all 32 NWChem instances run on the first node. Or try to, the job runs out of memory. AFAIK I compiled NWChem properly, and have installed it in a network accessible location.

What, likely simple, thing have I overlooked? Thanks,

Steve

Excerpt from PBS submit file:

NPROCS=`wc -l < $PBS_NODEFILE`
module load nwchem
module load openmpi
mpirun --hostfile $PBS_NODEFILE -np $NPROCS nwchem input.dat > output.dat


The script used to build NWChem is :

  1. !/bin/csh
setenv NWCHEM_MODULES all

setenv NWCHEM_TOP /home/admin/root/src/nwchem-6.3.revision2-src.2013-10-17
setenv NWCHEM_TARGET LINUX64

setenv LARGE_FILES TRUE
setenv LIB_DEFINES -DDFLT_TOT_MEM=134217728
setenv USE_NOFSCHECK TRUE
setenv TCGRSH /usr/bin/ssh
setenv FC ifort
setenv CC icc

setenv USE_MPI y
setenv USE_MPIF y
setenv USE_MPIF4 y
  1. setenv LIBMPI "-L/usr/lib64 -lmca_common_sm -lmpi_f77 -lmpi -lopen-pal -lopen-trace-format -lvt-hyb -lvt-mpi-unify -lvt -lmpi_cxx -lmpi_f90 -lompitrace -lopen-rte -lotfaux -lvt-mpi -lvt-mt"
setenv LIBMPI "-L/usr/lib64 -lmca_common_sm -lmpi_f77 -lmpi -lmpi_cxx -lmpi_f90 -lompitrace -lopen-rte -lotfaux -ldl -Wl,--export-dynamic -lnsl -lutil"

setenv MPI_BASEDIR /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1
setenv MPI_INCLUDE $MPI_BASEDIR/include
setenv MPI_LIB $MPI_BASEDIR/lib

setenv IB_HOME "/usr"
setenv IB_INCLUDE "$IB_HOME/include"
setenv IB_LIB "$IB_HOME/lib64"
setenv IB_LIB_NAME "-libumad -lpthread"
setenv ARMCI_NETWORK OPENIB

module load intel/14.0.1
module load openmpi/1.6.5/intel/14.0.1

    1. setenv BLASOPT "-L/zhome/Apps/intel/composerxe/mkl/lib/intel64/ -lmkl_blas95_ilp64 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -lbmkl_lapack95_ilp64"

echo "build time, here we go."
printenv

cd $NWCHEM_TOP/src
                                                                                                                    1. make realclean;
make >& make.log2
[root@hbar1 src]#

Forum Vet
Could you send us, for a given run (better if the following commands are executed inside the PBS script)
1) The output of the command
mpirun -V
2) The output of the command
ldd nwchem
2) The content of the file $PBS_NODEFILE
3) the output of the command
grep hostname output.dat

Clicked A Few Times
Sure, thank you. This copied directly from the result out/err files:

1
mpirun (Open MPI) 1.6.5

Report bugs to http://www.open-mpi.org/community/help/

2
linux-vdso.so.1 => (0x00007fff2ebff000)
libmca_common_sm.so.3 => /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1/lib/libmca_common_sm.so.3 (0x00007f9ad8085000)
libmpi_f77.so.1 => /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1/lib/libmpi_f77.so.1 (0x00007f9ad7e4e000)
libmpi.so.1 => /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1/lib/libmpi.so.1 (0x00007f9ad7a4e000)
libmpi_cxx.so.1 => /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1/lib/libmpi_cxx.so.1 (0x00007f9ad7832000)
libmpi_f90.so.1 => /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1/lib/libmpi_f90.so.1 (0x00007f9ad762f000)
libompitrace.so.0 => /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1/lib/libompitrace.so.0 (0x00007f9ad742b000)
libopen-rte.so.4 => /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1/lib/libopen-rte.so.4 (0x00007f9ad70f1000)
libotfaux.so.0 => /usr/local/openmpi/openmpi-1.6.5/intel-14.0.1/lib/libotfaux.so.0 (0x00007f9ad6ee4000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003fc6c00000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003fd6c00000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003fd7c00000)
libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x0000003976200000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003fc6800000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000003975e00000)
libm.so.6 => /lib64/libm.so.6 (0x0000003fc7000000)
libc.so.6 => /lib64/libc.so.6 (0x0000003fc6400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003fcd000000)
librt.so.1 => /lib64/librt.so.1 (0x0000003fc7400000)
libimf.so => /zhome/Apps/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libimf.so (0x00007f9ad69fc000)
libsvml.so => /zhome/Apps/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libsvml.so (0x00007f9ad5e05000)
libirng.so => /zhome/Apps/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libirng.so (0x00007f9ad5bfe000)
libintlc.so.5 => /zhome/Apps/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libintlc.so.5 (0x00007f9ad59a7000)
libcilkrts.so.5 => /zhome/Apps/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libcilkrts.so.5 (0x00007f9ad5769000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003fcd800000)
libifport.so.5 => /zhome/Apps/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libifport.so.5 (0x00007f9ad5539000)
libifcore.so.5 => /zhome/Apps/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libifcore.so.5 (0x00007f9ad51f8000)
libifcoremt.so.5 => /zhome/Apps/intel/composer_xe_2013_sp1.1.106/compiler/lib/intel64/libifcoremt.so.5 (0x00007f9ad4e8a000)
/lib64/ld-linux-x86-64.so.2 (0x0000003fc6000000)

3
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar5
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4
hbar4

4) Unfortunately I broke the previous install with a subsequent build with a new value of LIBMPI. So I am starting a new .... so I have no output.dat fie anymore.

Clicked A Few Times
So I now a clue as to what's going on: lack of Infiniband memory. Here are the relevant error messages, although I am unsure how to resolve the problem:

1) This from the PBS standard output:

(rank:0 hostname:hbar11 pid:14250):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))
(rank:16 hostname:hbar9 pid:1254):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))


2) This from the PBS standard error ... note that my attempt to increase locked memory using "ulimit -l 800000" which works in .bashrc from an interactive session failed from the PBS job:

/home/lusol/.bashrc: line 14: ulimit: max locked memory: cannot modify limit: Operation not permitted
mpirun (Open MPI) 1.6.5

Report bugs to http://www.open-mpi.org/community/help/
/var/spool/pbs/mom_priv/jobs/1823.hbar1.cc.lehigh.edu.SC: line 42: ulimit: max locked memory: cannot modify limit: Operation not permitted
Last System Error Message from Task 0:: Cannot allocate memory
Last System Error Message from Task 16:: Cannot allocate memory


MPI_ABORT was invoked on rank 16 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 1.

...
...
...

Stack trace terminated abnormally.
[hbar11:14249] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[hbar11:14249] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


3) Finally, this from the NWchem output file:

more nwOutput.dat 
argument 1 = nwInput.dat
(rank:0 hostname:hbar11 pid:14250):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))
(rank:16 hostname:hbar9 pid:1254):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))


Thanks for any insight.

Clicked A Few Times
Based on further searching, this is the cluster's kernel.shmmax:

sysctl kernel.shmmax
kernel.shmmax = 68719476736


Adding this to the PBS submit file and my .bashrc resulted in no change:

export ARMCI_DEFAULT_SHMMAX=6553

Forum Vet
it is likely that uou are hitting the a limitation on the maximum amount of registered memory.
The link below from the OpenMPI FAQ shows you how to address this problem.

http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

Clicked A Few Times
Son of a gun! After days of thrashing around, it turned out to be the stupid default PBS startup script, which has this in it:

[xyzzy]# grep ulimit /etc/init.d/*
/etc/init.d/pbs: ulimit -l 262144


This is a private cluster so I just set it to unlimited, pushed the new startup file to all the nodes, and NWchem is running fine now on two 16-core nodes.

Is this a standard PBS value for locked memory, or did the vendor who shipped this cluster set that?

Many thanks for you help,
Steve

Forum Vet
Quote:Pabugeater

Is this a standard PBS value for locked memory, or did the vendor who shipped this cluster set that?



The OpenMPI FAQ page might be of help at the following item

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user

Clicked A Few Times
Why the unilization ratio of CPU is more than 100% in the running process of nwchem?
Hi,
   I am a new user of nwchem.  
I had compiled nwchem6.3 by using of intel compiler and mkl math library?composer_xe_2013_sp1.2.144?and openmpi-1.6.5?
Butin the running process of nwchem, the unilization ratio of CPU is more than 100% (Some is up to 300%, resulting in the node nearly collapse?
Why?
Please help me!
Thank you!

gvtheen


Forum >> NWChem's corner >> Compiling NWChem