NWChem 6.1.1 CCSD(T) parallel running

Just Got Here

5:40:19 PM PDT - Sat, Jun 29th 2013

Hi I trying to running NWchem 6.1.1 in a cluster, I compiled NWChem in my local user directory, Here are the environment variables I used to compile :

export NWCHEM_TOP="/home/diego/Software/NWchem/nwchem-6.1.1"
export TARGET=LINUX64
export LARGE_FILES=TRUE
export ENABLE_COMPONENT=yes
export TCGRSH=/usr/bin/ssh
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all python"
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"

export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y

export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include/infiniband
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libumad -libverbs -lpthread -lrt"
export ARMCI_NETWORK=OPENIB

export MKLROOT="/opt/intel/mkl"
export MKL_INCLUDE=$MKLROOT/include/intel64/ilp64

export BLAS_LIB="-L$MKLROOT/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm"
export BLASOPT="$BLAS_LIB"
export BLAS_SIZE=8
export SCALAPACK_SIZE=8
export SCALAPACK="-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export SCALAPACK_LIB="$SCALAPACK"
export USE_SCALAPACK=y

export MPI_HOME=/opt/intel/impi/4.0.3.008
export MPI_LOC=$MPI_HOME
export MPI_LIB=$MPI_LOC/lib64
export MPI_INCLUDE=$MPI_LOC/include64
export LIBMPI="-lmpigf -lmpigi -lmpi_ilp64 -lmpi"

export CXX=/opt/intel/bin/icpc
export CC=/opt/intel/bin/icc
export FC=/opt/intel/bin/ifort

export PYTHONPATH="/usr"
export PYTHONHOME="/usr"
export PYTHONVERSION="2.6"
export USE_PYTHON64=y
export PYTHONLIBTYPE=so

export MPICXX=$MPI_LOC/bin/mpiicpc
export MPICC=$MPI_LOC/bin/mpiicc
export MPIF77=$MPI_LOC/bin/mpiifort

input file :

start
memory global 1000 mb heap 100 mb stack 600 mb 
title "ZrB10 CCSD(T) single point"
echo 
scratch_dir /scratch/users
charge -1
geometry units angstrom
Zr          0.00001        -0.00002         0.12043
B           2.46109         0.44546        -0.10200
B           2.25583        -1.07189        -0.09994
B           1.19305        -2.20969        -0.10354
B          -0.32926        -2.46629        -0.09796
B          -1.72755        -1.82109        -0.10493
B          -2.46111        -0.44543        -0.10198
B          -2.25583         1.07193        -0.09983
B          -1.19306         2.20972        -0.10337
B           0.32924         2.46632        -0.09779
B           1.72753         1.82112        -0.10485
end
scf 
DOUBLET; UHF
THRESH 1.0e-10
TOL2E 1.0e-8
maxiter 200
end 
tce
 ccsd(t)
 maxiter 200
 freeze atomic
end 
basis
Zr library def2-tzvp 
B library def2-tzvp
end
ecp
Zr library def2-ecp
end
task tce energy

pbs submit file:

c#!/bin/bash
#PBS -N ZrB10_UHF
#PBS -l nodes=10:ppn=16
#PBS -q CA
BIN=/home/diego/Software/NWchem/nwchem-6.1.1/bin/LINUX64
source /opt/intel/impi/4.0.3.008/bin/mpivars.sh
source /home/diego/Software/NWchem/vars
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/opt/intel/impi/4.0.3/intel64/lib

#ulimit -s unlimited
#ulimit -d unlimited
#ulimit -l unlimited
#ulimit -n 32767    
export ARMCI_DEFAULT_SHMMAX=8000
#export MA_USE_ARMCI_MEM=TRUE
cd $PBS_O_WORKDIR

NP=`(wc -l < $PBS_NODEFILE) | awk '{print $1}'`

cat $PBS_NODEFILE |sort|uniq> mpd.hosts
time mpirun -f mpd.hosts -np $NP $BIN/nwchem ZrB10.nw > ZrB10.log
exit 0

memory for procesador 2GB of RAM, in 16 proc with 32GB of RAM in 10 nodes
and other ticks :

kernel.shmmax = 68719476736

the file error is

Last System Error Message from Task 32:: Cannot allocate memory

(rank:32 hostname:node32 pid:27391):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))

Varying stack, heap or global and ARMCI_DEFAULT_SHMMAX does not really change anything (if I set them low, then another error occurs). Setting MA_USE_ARMCI_MEM = y/n does not have any effect.

ldd /home/diego/Software/NWchem/nwchem-6.1.1/bin/LINUX64/nwchem :

        linux-vdso.so.1 =>  (0x00007ffff7ffe000)
        libpython2.6.so.1.0 => /usr/lib64/libpython2.6.so.1.0 (0x0000003f3aa00000)
        libmkl_scalapack_ilp64.so => not found
        libmkl_intel_ilp64.so => not found
        libmkl_sequential.so => not found
        libmkl_core.so => not found
        libmkl_blacs_intelmpi_ilp64.so => not found
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f39200000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003f38600000)
        libmpigf.so.4 => not found
        libmpi_ilp64.so.4 => not found
        libmpi.so.4 => not found
        libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x000000308aa00000)
        libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000308a600000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003f39a00000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003f3c200000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003f38e00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003f38a00000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ffff7dce000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003f38200000)

So what could be the reason for the failure? Any help would be appreciated.

Diego

Forum Vet

12:34:55 PM PDT - Fri, Jul 19th 2013
Diego I have managed to get this input working on a Infiniband cluster using NWChem 6.3. Here is some details of what I have done on a run using 224 processors(16 processors on each one of the 14 nodes) 1) Increased global memory input line to 1.6GB memory global 1600 mb heap 100 mb stack 600 mb 2) Set ARMCI_DEFAULT_SHMMAX=8192 3) You need to have the system administrators to modify some of the kernel driver options for your Infiniband Hardware Here are some webpages related to this very topic http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem http://community.mellanox.com/docs/DOC-1120 In my case, the cluster I am using has the following parameter for the mlx4_core driver (but older hardware might require different setting, as mentioned in the two webpages above) log_num_mtt=20 log_mtts_per_seg=4 Cheers, Edo

Forum >> NWChem's corner >> Running NWChem