6.1.1 MPI build runs great, but only on 1 node


Jump to page 12Next 16Last
Just Got Here

Here's what I did to build it:

export NWCHEM_TOP=$PWD
export NWCHEM_TARGET=LINUX64
export INSTALL_PREFIX=/opt/nwchem/6.1.1
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export MPI_LIB=/opt/openmpi/1.4.3/lib
export MPI_INCLUDE=/opt/openmpi/1.4.3/include
export FC=gfortran
export CC=gcc
cd $NWCHEM_TOP/src
make nwchem_config NWCHEM_MODULES=all
make
mkdir -p $INSTALL_PREFIX
mkdir -p $INSTALL_PREFIX/bin
mkdir -p $INSTALL_PREFIX/data
cp $NWCHEM_TOP/bin/${NWCHEM_TARGET}/nwchem $INSTALL_PREFIX/bin
chmod 755 $INSTALL_PREFIX/bin/nwchem
cp -r $NWCHEM_TOP/src/basis/libraries $INSTALL_PREFIX/data
cp -r $NWCHEM_TOP/src/data $INSTALL_PREFIX
cp -r $NWCHEM_TOP/src/nwpw/libraryps $INSTALL_PREFIX/data

Here's how I run it (using PBS Professional 11.2):

  1. !/bin/bash
  2. PBS -N nwchem
  3. PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
  4. PBS -j oe
mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out

But all 16 processes appear on only one of the 2 nodes I've been allocated for this job. If I switch to running on only 1 node, everything looks great, but more than 1 node causes all of the processes to double-up on only the "master" node.

Any ideas/comments/suggestions?

Thanks a lot!

Just Got Here
Ooops! My PBS job script was autoformatted when I submitted. It should look like this:

\#!/bin/bash
\#PBS -N nwchem
\#PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
\#PBS -j oe

mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out

Just Got Here
I posted this in the "Compiling NWChem" section because I suspect that this problem is associated with the way I built my executable.

Forum Vet
This is not a build issue as far as I can see. It is the mpiexec command that starts the 16 nwchem processes on one node, nwchem itself has nothing to do with that. You may want to look at the mpiexec manual. For example adding "-npernode 8" might give you want you need. Alternatively, you may want to use mpirun.

Bert

Quote:Chemogan Jul 18th 5:31 pm
Ooops! My PBS job script was autoformatted when I submitted. It should look like this:

\#!/bin/bash
\#PBS -N nwchem
\#PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
\#PBS -j oe

mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out

Just Got Here
Thanks Bert. Yeah, I'm getting the impression that I did build NWChem successfully, and that I'm just having some trouble with OpenMPI (I'm more accustomed to MPICH2).


I added "-hostfile" and "-npernode" to my command (mpiexec is just a synonym for mpirun, they're both symbolic links for orterun):

mpiexec -n 16 -hostfile $PBS_NODEFILE -npernode 8 nwchem n2.mp2.ccsd.nwchem > n2.mp2.ccsd.out

Sorry for posting in the "Compiling" section. Perhaps this thread should be moved to the "Running" seciton, if that's possible.

Thanks so much for your help!

Gets Around
Hi,

it seems reopening of the thread is needed.
The nwchem 6.1.1 does not run accross the nodes on my system too. Nwchem 6.0 runs fine.
The 6.1.1 (and also the initial 6.1 release), when run across the nodes, crashes with:

argument 1 = ../nwchem.nw -10000:armci_AcceptSockAll:timeout waiting for connection: 0 (rank:-10000 hostname:d071.dcsc.fysik.dtu.dk pid:20939):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_AcceptSockAll():673 cond:0 0:armci_rcv_data: read failed: -1 (rank:0 hostname:d071.dcsc.fysik.dtu.dk pid:20936):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/dataserv.c:armci_ReadFromDirect():439 cond:0 -10002:armci_AcceptSockAll:timeout waiting for connection: 0 (rank:-10002 hostname:d031.dcsc.fysik.dtu.dk pid:22561):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_AcceptSockAll():673 cond:0 2:Child process terminated prematurely, status=: 256 (rank:2 hostname:d031.dcsc.fysik.dtu.dk pid:22558):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigChldHandler():178 cond:0

The http://nwchemgit.github.io/images/Nwchem-6.1.1-src.2012-06-27.tar.gz is built against openmpi 1.3.3 with torque support, with the following script (irrelavant parts of the filesystem paths are replaced by ...) on CentOS 5, x86_64:

export NWCHEM_TOP=/.../nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export LD_LIBRARY_PATH=/.../lib64
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPIEXEC=/.../bin/mpiexec
export MPI_LIB=/.../lib64
export MPI_INCLUDE=/.../include/
export LIBMPI='-L/.../lib64 -lmpi -lmpi_f90 -lmpi_f77'
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export TCGRSH=ssh
export PYTHONHOME=/usr
export PYTHONVERSION=2.4
export PYTHONLIBTYPE=a
export USE_PYTHON64=y
export HAS_BLAS=yes
export BLASOPT="-L/usr/lib64 -lblas -llapack"
make nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee make_nwchem_config.log
make 64_to_32 2>&1 | tee make_64_to_32.log
make USE_64TO32=y 2>&1 | tee make.log

I run the following example (with mpiexec `which nwchem` nwchem.nw):

geometry noautoz noautosym
O 0.0 0.0 1.245956
O 0.0 0.0 0.0
end
basis spherical
\* library cc-pvdz
end

dft
mult 3
xc xpbe96 cpbe96
smear 0.0
direct
noio
end

task dft energy

I have tried also to specify the PBS_NODEFILE explicitly with --hostfile ${PBS_NODEFILE}.
On the nodes, i see just one nwchem per node sitting with 100% of CPU, other instances are with 0 CPU load.

Forum Vet
Marcindulak
Could please post the following files
$NWCHEM_TOP/src/tools/build/config/makefile.h
$NWCHEM_TOP/src/tools/build/armci/config/makefile.h

Please send the output of the following command, too
mpiexec -V

It would be useful to see the full error/output file from NWChem,
with -v option passed to mpiexec

Gets Around
I run this time with:
mpiexec -wdir `pwd` --tmpdir `pwd` --debug-daemons --verbose `which nwchem` nwchem.nw

The resulting files are available:
http://dcwww.camd.dtu.dk/~dulak/nwchem-6.1.1-build_config.log
http://dcwww.camd.dtu.dk/~dulak/nwchem-6.1.1-build_armci_config.log
http://dcwww.camd.dtu.dk/~dulak/nwchem-6.1.1.err
http://dcwww.camd.dtu.dk/~dulak/nwchem-6.1.1.out
http://dcwww.camd.dtu.dk/~dulak/ompi_info

I would also like to see the rules what characters can be use and what when posting on the forum not clearly described: see
http://nwchemgit.github.io/Special_AWCforum/st/id338/I_can%27t_post_in_the_compili...
http://nwchemgit.github.io/Special_AWCforum/st/id493/%22The_specified_URL_cannot_b...
The forum should be self contained, so one does not need to create extenal links in order to provide the requested files.

Forum Vet
What Linux Distribution?
Marcindulak
What linux distribution & version are you using?

Forum Vet
BLAS size
Marcindulak
The only problem I spotted so far (and that should not explain the inter-node problem) is that your are using blas (and maybe lapack)
from /usr/lib64. My guess is that library uses 32-bit integers. If this is indeed the case, you would need to specify that to the tools
configurations by specifying the following environmental variables
BLAS_SIZE=4
LAPACK_SIZE=4

Forum Vet
Marcindulak
I have managed to reproduce your problem.
However, I do not see any difference with 6.0 ... can you confirm that 6.0 works fine when using ARMCI_NETWORK=SOCKETS and using OpenMPI?

Cheers, Edo

Gets Around
I have compiled 6.1.1 with {BLAS,LAPACK}_SIZE=4 without solving the mpi problem, apart from getting --with-blas4="-L/usr/lib64 -lblas -llapack" in the make stages. As a side comment shouldn't LAPACK_LIB variable be set too, and not only BLASOPT?
I see LAPACK_LIB variable is not mentioned at http://nwchemgit.github.io/index.php/Compiling_NWChem
This makes the output when setting BLASOPT to look like:
--without-lapack --with-blas8=-L/usr/lib64 -lblas -llapack

The 6.0 version i use is this one:
http://download.opensuse.org/repositories/home:/marcindulak/CentOS_CentOS-5/
with the log available:
https://build.opensuse.org/package/live_build_log?arch=x86_64&package=nwchem&proje...
It does not look like nwchem 6.0 prints anything about ARMCI_NETWORK, and i haven't set anything.

In my impression the problems with crashes across the nodes started around the time when i had to set
USE_MPIF4=y in order to kompile nwchem.

Forum Vet
Marcindulak,
Could you please send me the full stderr/stdout of a successful multinode run with 6.0?
Could you please add the following options to mpiexec/mpirun/orterun
--mca btl_base_verbose 50 --mca btl_openib_verbose 1
Thanks, Edo

Forum Vet
Please ignore previous post
Marcindulak,
Please ignore the previous post since I have managed to reproduce your findings using the 6.0 and 6.1.1 binaries from your RPMs
(it took me a while to figure out the write openmpi orterun option to get things working, however ...).
More later, Edo

Forum Vet
How to revert 6.1 back to the 6.0 behavior for the tools directory
Marcindulak,
The following recipe might work to fix your 6.1 issues (it worked for me).
It allows you to link with the same parallel tools used in 6.0.

cd $NWCHEM_TOP/src/tools
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y clean
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y
cd ..
make FC=gfortran link

Cheers, Edo


Forum >> NWChem's corner >> Compiling NWChem
Jump to page 12Next 16Last