MPI-parallel plane wave build


Clicked A Few Times
Hello, all.

I've been trying for some time to get a functional build of the plane wave solvers that will work in parallel via MPI, however I've thus far come up empty. I've tried building on RedHat 6.0, RedHat 6.3, CentOS 6.3, and CentOS 5.8, using MPICH 2, OpenMPI, GNU Fortran, and Intel compilers, yet I always see the same behavior: When run in parallel with a plane wave input, the nwchem processes just seem to 'hang'. stracing the processes just show them all to be polling indefinitely:

epoll_wait(4, {}, 32, 0)                = 0
epoll_wait(4, {}, 32, 0)                = 0
epoll_wait(4, {}, 32, 0)                = 0
epoll_wait(4, {}, 32, 0)                = 0
epoll_wait(4, {}, 32, 0)                = 0


NWChem run in serial with the same input runs to completion.

To eliminate as many variables as possible, I'm currently building on a stock CentOS 6.3 system with GNU compilers and OpenMPI installed (OpenMPI being version 1.5.4 provided with CentOS/RedHat). The system has no Infiniband, to further simplify things.

This is the process I am using to perform the build/install:

cd nwchem-src-2012-12-01

export NWCHEM_TOP=$PWD
export NWCHEM_TARGET=LINUX64
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=yes
export MPI_LOC=/usr/lib64/openmpi
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=/usr/include/openmpi-x86_64
export LIBMPI="-pthread -m64 -lmpi_f90 -lmpi_f77 -lmpi -ldl"
export LARGE_FILE=TRUE
export NWCHEM_MODULES=all
export FC=gfortran

cd $NWCHEM_TOP/src

make realclean

make nwchem_config 2>&1 | tee make.nwchem_config.out.$(date +%Y%m%d%H%M)

make FC=gfortran 2>&1 | tee make.out.$(date +%Y%m%d%H%M)

NWCHEM_INSTALL_DIR=$HOME/nwchem/ompi/20121201

mkdir -p $NWCHEM_INSTALL_DIR/bin
mkdir -p $NWCHEM_INSTALL_DIR/data

cp $NWCHEM_TOP/bin/${NWCHEM_TARGET}/nwchem $NWCHEM_INSTALL_DIR/bin
chmod 755 $NWCHEM_INSTALL_DIR/bin/nwchem
cp -r $NWCHEM_TOP/src/basis/libraries $NWCHEM_INSTALL_DIR/data/
cp -r $NWCHEM_TOP/src/data $NWCHEM_INSTALL_DIR/
cp -r $NWCHEM_TOP/src/nwpw/libraryps $NWCHEM_INSTALL_DIR/data/

cat << _EOF_ > $NWCHEM_INSTALL_DIR/data/nwchemrc
nwchem_basis_library $NWCHEM_INSTALL_DIR/data/libraries/
nwchem_nwpw_library $NWCHEM_INSTALL_DIR/data/libraryps/
ffield amber
amber_1 $NWCHEM_INSTALL_DIR/data/amber_s/
amber_2 $NWCHEM_INSTALL_DIR/data/amber_q/
amber_3 $NWCHEM_INSTALL_DIR/data/amber_x/
amber_4 $NWCHEM_INSTALL_DIR/data/amber_u/
spce    $NWCHEM_INSTALL_DIR/data/solvents/spce.rst
charmm_s $NWCHEM_INSTALL_DIR/data/charmm_s/
charmm_x $NWCHEM_INSTALL_DIR/data/charmm_x/
_EOF_

ln -s $NWCHEM_INSTALL_DIR/data/nwchemrc $HOME/.nwchemrc

export LD_LIBRARY_PATH=${MPI_LIB}:${LD_LIBRARY_PATH}
export NWCHEM=$HOME/nwchem/ompi/20121201/bin/nwchem


My directory structure for the run is like so:
.
./scratch
./output
./output/S2
./output/S2/known
./output/S2/known/S2-example1.nwout
./output/S2/np1
./output/S2/np1/S2-example1.out
./output/S2/np2
./output/S2/np2/S2-example1.out
./perm
./input
./input/S2-example1.nw


I am using a simple plane wave case from the tutorial in the NWChem wiki:
echo
title "total energy of s2-dimer LDA/30Ry with PSPW method"
scratch_dir   ./scratch
permanent_dir ./perm
start s2-pspw-energy
geometry
S 0.0 0.0 0.0
S 0.0 0.0 1.88
end
nwpw
  simulation_cell
    SC 20.0
  end
  cutoff 15.0

  mult 3
  xc lda
  lmbfgs
end
task pspw energy


NWChem is executed like so:
gabe@centos6.3 [~/nwchem/pw-examples] % mpirun -np 2 $NWCHEM input/S2-example1.nw 2>&1 | tee output/S2/np2/S2-example1.out


ldd of the nwchem binary:
gabe@centos6.3 [~/nwchem/pw-examples] % ldd $NWCHEM
        linux-vdso.so.1 =>  (0x00007fff56c7c000)
        libmpi_f90.so.1 => /usr/lib64/openmpi/lib/libmpi_f90.so.1 (0x00007fa22b3fd000)
        libmpi_f77.so.1 => /usr/lib64/openmpi/lib/libmpi_f77.so.1 (0x00007fa22b1c9000)
        libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x0000003f4da00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003f4c200000)
        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007fa22aebd000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003f4be00000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003f55e00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f4c600000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003f4ba00000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003f5c600000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003f5ae00000)
        libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x0000003f5a200000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003f4b600000)


Full disclosure: I am not a chemist, but an HPC administrator trying to get this working on behalf of one of my users, so I apologize in advance for my ignorance regarding the science in play.

I guess my ultimate questions are:

1) Should I even expect the plane wave solvers to work in parallel?

2) Has anyone gotten NWchem 6.x pspw/nwpw working in parallel via MPI recently?

3) If 2), how?

Any help would be greatly appreciated. Thanks in advance,

Gabe

Forum Vet
Answers
1) yes, all parts of NWChem are parallel and scalable.

2) yes, NWChem plane wave runs in parallel on many platforms.

3) The information is incomplete, but let me try:

a) Looks like you're running 64-bit. You may want to try and compile without export USE_MPIF4=yes

b) You say it hangs, where does it hang. Having some output that tells us where it hangs would be helpful. Does it hang at start up or somewhere during the calculation?

Bert

Clicked A Few Times
Quote:Bert Jan 7th 8:49 pm
1) yes, all parts of NWChem are parallel and scalable.

2) yes, NWChem plane wave runs in parallel on many platforms.

3) The information is incomplete, but let me try:

a) Looks like you're running 64-bit. You may want to try and compile without export USE_MPIF4=yes

b) You say it hangs, where does it hang. Having some output that tells us where it hangs would be helpful. Does it hang at start up or somewhere during the calculation?

Bert


I appreciate the reply, Bert. I will try it without the USE_MPIF4=yes. The run hangs after generating S.vpp. It does generate a number of files, though:

gabe@centos6.3 [~/nwchem/pw-examples] % find scratch perm -ls
3426406    4 drwx------   2 gabe     gabe         4096 Jan  7 16:09 scratch
3426421    4 -rw-------   1 gabe     gabe           72 Jan  7 16:09 scratch/s2-pspw-energy.b^-1
3426419    4 -rw-------   1 gabe     gabe           72 Jan  7 16:09 scratch/s2-pspw-energy.b
3426420    4 -rw-------   1 gabe     gabe           32 Jan  7 16:09 scratch/s2-pspw-energy.p
3426415    4 -rw-------   1 gabe     gabe           32 Jan  7 16:09 scratch/s2-pspw-energy.c
3426418    4 -rw-------   1 gabe     gabe           32 Jan  7 16:09 scratch/s2-pspw-energy.zmat
3426407    4 drwx------   2 gabe     gabe         4096 Jan  7 16:09 perm
3426414   92 -rw-------   1 gabe     gabe        91601 Jan  7 16:09 perm/s2-pspw-energy.db
3426424  156 -rw-------   1 gabe     gabe       156209 Jan  7 16:09 perm/S.psp
3426422 2540 -rw-------   1 gabe     gabe      2600426 Jan  7 16:09 perm/S.vpp


Also perhaps worth mentioning is that a 512MB shared memory segment is created, which I notice does not happen when nwchem is run in serial with this input:
gabe@centos6.3 [~/nwchem/pw-examples/output] % ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00000000 131072     gabe       600        393216     2          dest         
0x00000000 163841     gabe       600        393216     2          dest         
0x00000000 196610     gabe       600        393216     2          dest         
0x00000000 229379     gabe       600        393216     2          dest         
0x00000000 425988     gabe       600        393216     2          dest         
0x00000000 884741     gabe       600        393216     2          dest         
0x00000000 917510     gabe       600        393216     2          dest         
0x00000000 1114119    gabe       600        536870912  2          dest

 
Here's the complete output:

gabe@centos6.3 [~/nwchem/pw-examples] % mpirun -np 2 $NWCHEM input/S2-example1.nw 2>&1 | tee output/S2/np2/S2-example1.out
 argument  1 = input/S2-example1.nw



============================== echo of input deck ==============================
echo
title "total energy of s2-dimer LDA/30Ry with PSPW method"
scratch_dir   ./scratch
permanent_dir ./perm
start s2-pspw-energy
geometry
S 0.0 0.0 0.0
S 0.0 0.0 1.88
end
nwpw
  simulation_cell
    SC 20.0
  end
  cutoff 15.0

  mult 3
  xc lda
  lmbfgs
end
task pspw energy
================================================================================


                                         
                                         


             Northwest Computational Chemistry Package (NWChem) 6.1.1
             --------------------------------------------------------


                    Environmental Molecular Sciences Laboratory
                       Pacific Northwest National Laboratory
                                Richland, WA 99352

                              Copyright (c) 1994-2012
                       Pacific Northwest National Laboratory
                            Battelle Memorial Institute

             NWChem is an open-source computational chemistry package
                        distributed under the terms of the
                      Educational Community License (ECL) 2.0
             A copy of the license is included with this distribution
                              in the LICENSE.TXT file

                                  ACKNOWLEDGMENT
                                  --------------

            This software and its documentation were developed at the
            EMSL at Pacific Northwest National Laboratory, a multiprogram
            national laboratory, operated for the U.S. Department of Energy
            by Battelle under Contract Number DE-AC05-76RL01830. Support
            for this work was provided by the Department of Energy Office
            of Biological and Environmental Research, Office of Basic
            Energy Sciences, and the Office of Advanced Scientific Computing.


           Job information
           ---------------

    hostname        = centos6.3
    program         = /home/gabe/nwchem/ompi/20121201/bin/nwchem
    date            = Mon Jan  7 16:09:46 2013

    compiled        = Mon_Jan_07_15:53:25_2013
    source          = /home/gabe/nwchem/ompi/build/nwchem-src-2012-12-01
    nwchem branch   = Development
    nwchem revision = 23203
    ga revision     = 10141
    input           = input/S2-example1.nw
    prefix          = s2-pspw-energy.
    data base       = ./perm/s2-pspw-energy.db
    status          = startup
    nproc           =        2
    time left       =     -1s



           Memory information
           ------------------

    heap     =   13107201 doubles =    100.0 Mbytes
    stack    =   13107201 doubles =    100.0 Mbytes
    global   =   26214400 doubles =    200.0 Mbytes (distinct from heap & stack)
    total    =   52428802 doubles =    400.0 Mbytes
    verify   = yes
    hardfail = no 


           Directory information
           ---------------------

  0 permanent = ./perm
  0 scratch   = ./scratch




                                NWChem Input Module
                                -------------------


                total energy of s2-dimer LDA/30Ry with PSPW method
                --------------------------------------------------

 Scaling coordinates for geometry "geometry" by  1.889725989
 (inverse scale =  0.529177249)

 ORDER OF PRIMARY AXIS IS BEING SET TO 4
 D4H symmetry detected

          ------
          auto-z
          ------


                             Geometry "geometry" -> ""
                             -------------------------

 Output coordinates in angstroms (scale by  1.889725989 to convert to a.u.)

  No.       Tag          Charge          X              Y              Z
 ---- ---------------- ---------- -------------- -------------- --------------
    1 S                   16.0000     0.00000000     0.00000000    -0.94000000
    2 S                   16.0000     0.00000000     0.00000000     0.94000000

      Atomic Mass 
      ----------- 

      S                 31.972070


 Effective nuclear repulsion energy (a.u.)      72.0581785872

            Nuclear Dipole moment (a.u.) 
            ----------------------------
        X                 Y               Z
 ---------------- ---------------- ----------------
     0.0000000000     0.0000000000     0.0000000000

      Symmetry information
      --------------------

 Group name             D4h       
 Group number             28
 Group order              16
 No. of unique centers     1

      Symmetry unique atoms

     1



                                Z-matrix (autoz)
                                -------- 

 Units are Angstrom for bonds and degrees for angles

      Type          Name      I     J     K     L     M      Value
      ----------- --------  ----- ----- ----- ----- ----- ----------
    1 Stretch                  1     2                       1.88000


            XYZ format geometry
            -------------------
     2
 geometry
 S                     0.00000000     0.00000000    -0.94000000
 S                     0.00000000     0.00000000     0.94000000

 ==============================================================================
                                internuclear distances
 ------------------------------------------------------------------------------
       center one      |      center two      | atomic units |  angstroms
 ------------------------------------------------------------------------------
    2 S                |   1 S                |     3.55268  |     1.88000
 ------------------------------------------------------------------------------
                         number of included internuclear distances:          1
 ==============================================================================



          ****************************************************
          *                                                  *
          *               NWPW PSPW Calculation              *
          *                                                  *
          *  [ (Grassman/Stiefel manifold implementation) ]  *
          *                                                  *
          *      [ NorthWest Chemistry implementation ]      *
          *                                                  *
          *            version #5.10   06/12/02              *
          *                                                  *
          *    This code was developed by Eric J. Bylaska,   *
          *    and was based upon algorithms and code        *
          *    developed by the group of Prof. John H. Weare *
          *                                                  *
          ****************************************************
     >>>  JOB STARTED       AT Mon Jan  7 16:09:46 2013  <<<
          ================ input data ========================
  library name resolved from: compiled reference
  NWCHEM_NWPW_LIBRARY set to: </home/gabe/nwchem/ompi/build/nwchem-src-2012-12-01/src/nwpw/libraryps/>
 Generating 1d pseudopotential for S   

 Generated formatted_filename: ./perm/S.vpp

Clicked A Few Times
I unset USE_MPIF4=yes and rebuilt. I still see the same behavior.

Gabe

Gets Around
Try setting

setenv USE_MPIF4 y

instead of
setenv USE_MPIF4 yes

Clicked A Few Times
Quote:Bylaska Jan 8th 5:04 pm
Try setting

setenv USE_MPIF4 y

instead of
setenv USE_MPIF4 yes


I actually was setting USE_MPIF4=y; the 'yes' was a typo in my original post. I appreciate the reply, however.

Fortunately, after picking the brain of a colleague at another site, and poring over yet more forum posts, I came up with a winning combination late yesterday by adding the following:
ARMCI_NETWORK=SPAWN
MSG_COMMS=MPI


I've tested my plane wave examples up to 24 processes with success and will now turn things over to my users for verification.

Thanks for the help!

Forum Vet
SPAWN
To use the MPI spawn approach you would have to set ARMCI_NETWORK=MPI-SPAWN.

You got it to compile, but it build something else than you thought. It actually build it with the tcgmsg/MPI network.

Bert


Forum >> NWChem's corner >> Compiling NWChem