MPI-parallel plane wave build


Click here for full thread
Clicked A Few Times
Hello, all.

I've been trying for some time to get a functional build of the plane wave solvers that will work in parallel via MPI, however I've thus far come up empty. I've tried building on RedHat 6.0, RedHat 6.3, CentOS 6.3, and CentOS 5.8, using MPICH 2, OpenMPI, GNU Fortran, and Intel compilers, yet I always see the same behavior: When run in parallel with a plane wave input, the nwchem processes just seem to 'hang'. stracing the processes just show them all to be polling indefinitely:

epoll_wait(4, {}, 32, 0)                = 0
epoll_wait(4, {}, 32, 0)                = 0
epoll_wait(4, {}, 32, 0)                = 0
epoll_wait(4, {}, 32, 0)                = 0
epoll_wait(4, {}, 32, 0)                = 0


NWChem run in serial with the same input runs to completion.

To eliminate as many variables as possible, I'm currently building on a stock CentOS 6.3 system with GNU compilers and OpenMPI installed (OpenMPI being version 1.5.4 provided with CentOS/RedHat). The system has no Infiniband, to further simplify things.

This is the process I am using to perform the build/install:

cd nwchem-src-2012-12-01

export NWCHEM_TOP=$PWD
export NWCHEM_TARGET=LINUX64
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=yes
export MPI_LOC=/usr/lib64/openmpi
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=/usr/include/openmpi-x86_64
export LIBMPI="-pthread -m64 -lmpi_f90 -lmpi_f77 -lmpi -ldl"
export LARGE_FILE=TRUE
export NWCHEM_MODULES=all
export FC=gfortran

cd $NWCHEM_TOP/src

make realclean

make nwchem_config 2>&1 | tee make.nwchem_config.out.$(date +%Y%m%d%H%M)

make FC=gfortran 2>&1 | tee make.out.$(date +%Y%m%d%H%M)

NWCHEM_INSTALL_DIR=$HOME/nwchem/ompi/20121201

mkdir -p $NWCHEM_INSTALL_DIR/bin
mkdir -p $NWCHEM_INSTALL_DIR/data

cp $NWCHEM_TOP/bin/${NWCHEM_TARGET}/nwchem $NWCHEM_INSTALL_DIR/bin
chmod 755 $NWCHEM_INSTALL_DIR/bin/nwchem
cp -r $NWCHEM_TOP/src/basis/libraries $NWCHEM_INSTALL_DIR/data/
cp -r $NWCHEM_TOP/src/data $NWCHEM_INSTALL_DIR/
cp -r $NWCHEM_TOP/src/nwpw/libraryps $NWCHEM_INSTALL_DIR/data/

cat << _EOF_ > $NWCHEM_INSTALL_DIR/data/nwchemrc
nwchem_basis_library $NWCHEM_INSTALL_DIR/data/libraries/
nwchem_nwpw_library $NWCHEM_INSTALL_DIR/data/libraryps/
ffield amber
amber_1 $NWCHEM_INSTALL_DIR/data/amber_s/
amber_2 $NWCHEM_INSTALL_DIR/data/amber_q/
amber_3 $NWCHEM_INSTALL_DIR/data/amber_x/
amber_4 $NWCHEM_INSTALL_DIR/data/amber_u/
spce    $NWCHEM_INSTALL_DIR/data/solvents/spce.rst
charmm_s $NWCHEM_INSTALL_DIR/data/charmm_s/
charmm_x $NWCHEM_INSTALL_DIR/data/charmm_x/
_EOF_

ln -s $NWCHEM_INSTALL_DIR/data/nwchemrc $HOME/.nwchemrc

export LD_LIBRARY_PATH=${MPI_LIB}:${LD_LIBRARY_PATH}
export NWCHEM=$HOME/nwchem/ompi/20121201/bin/nwchem


My directory structure for the run is like so:
.
./scratch
./output
./output/S2
./output/S2/known
./output/S2/known/S2-example1.nwout
./output/S2/np1
./output/S2/np1/S2-example1.out
./output/S2/np2
./output/S2/np2/S2-example1.out
./perm
./input
./input/S2-example1.nw


I am using a simple plane wave case from the tutorial in the NWChem wiki:
echo
title "total energy of s2-dimer LDA/30Ry with PSPW method"
scratch_dir   ./scratch
permanent_dir ./perm
start s2-pspw-energy
geometry
S 0.0 0.0 0.0
S 0.0 0.0 1.88
end
nwpw
  simulation_cell
    SC 20.0
  end
  cutoff 15.0

  mult 3
  xc lda
  lmbfgs
end
task pspw energy


NWChem is executed like so:
gabe@centos6.3 [~/nwchem/pw-examples] % mpirun -np 2 $NWCHEM input/S2-example1.nw 2>&1 | tee output/S2/np2/S2-example1.out


ldd of the nwchem binary:
gabe@centos6.3 [~/nwchem/pw-examples] % ldd $NWCHEM
        linux-vdso.so.1 =>  (0x00007fff56c7c000)
        libmpi_f90.so.1 => /usr/lib64/openmpi/lib/libmpi_f90.so.1 (0x00007fa22b3fd000)
        libmpi_f77.so.1 => /usr/lib64/openmpi/lib/libmpi_f77.so.1 (0x00007fa22b1c9000)
        libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x0000003f4da00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003f4c200000)
        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007fa22aebd000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003f4be00000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003f55e00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f4c600000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003f4ba00000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003f5c600000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003f5ae00000)
        libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x0000003f5a200000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003f4b600000)


Full disclosure: I am not a chemist, but an HPC administrator trying to get this working on behalf of one of my users, so I apologize in advance for my ignorance regarding the science in play.

I guess my ultimate questions are:

1) Should I even expect the plane wave solvers to work in parallel?

2) Has anyone gotten NWchem 6.x pspw/nwpw working in parallel via MPI recently?

3) If 2), how?

Any help would be greatly appreciated. Thanks in advance,

Gabe