ARMCI DASSERT fails when running NWCHEM on 512 cores


Click here for full thread
Just Got Here
Dear all,

some runs of NWCHEM (version 6.0 and 6.1.1) lead to an failed assertion on our Intel Nehalem Cluster (roughly 2.8GB
memory per core are available) when running on 512 cores.
The message is as follows:

"0:Terminate signal was sent, status=: 15 (rank:0 hostname:jj20c79 pid:16917):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0"

We checked the memory consumption of the run and it appears that more and more memory is consumed during iteration steps without being deallocated again. As our system does not allow memory swapping the run crashes after a while. The user who drew that behavior to our attention assumes that the ARMCI driver doesn't fulfill the garbage cleaning the way it is expected by the "global array" toolkit. Might this be the case?

The code is compiled and linked with the Intel compiler 12.1.4 (OS: SUSE Linux Enterprise Server 11 (x86_64), Kernel: 2.6.32.59-0.3-default).

The following settings were used:

export PYTHONHOME="/usr/bin/python"
export PYTHONVERSION="2.6"

export MSG_COMMS="MPI"

export LARGE_FILES="TRUE"

export ARMCI_NETWORK=OPENIB
export NWCHEM_MODULES="all"
export NWCHEM_TOP="/some_directory/nwchem-6.1.1"

export NWCHEM_TARGET="LINUX64"

export USE_MPI="y"
export USE_MPIF="y"
export USE_MPIF4="y"

export MPI_HOME=/usr/local/parastation/mpi2-intel-5.0.26-1/
export MPI_LIB=$MPI_HOME/lib
export MPI_INCLUDE=$MPI_HOME/include
export LIBMPI="-lmpich"
export PATH=$PATH:$MPI_HOME/bin

cd $NWCHEM_TOP/src

make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc" nwchem_config
make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc" >& make.log

cd $NWCHEM_TOP/src/util

make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc" version
make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc"
cd $NWCHEM_TOP/src
make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc" link


Using the GCC compiler suite instead also does not solve the problem.
Additionally we tried the following environment variables without success:

ARMCI_DEFAULT_SHMMAX=4096 (or 1024 or 2048)
MA_USE_ARMCI_MEM=1

Did some of you also ever encounter the same problem?
Any help on this is highly appreciated.

Best regards

Alexander