7:56:19 AM PDT - Mon, Aug 20th 2012 |
|
Hi,
it seems reopening of the thread is needed.
The nwchem 6.1.1 does not run accross the nodes on my system too. Nwchem 6.0 runs fine.
The 6.1.1 (and also the initial 6.1 release), when run across the nodes, crashes with:
argument 1 = ../nwchem.nw
-10000:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10000 hostname:d071.dcsc.fysik.dtu.dk pid:20939):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_AcceptSockAll():673 cond:0
0:armci_rcv_data: read failed: -1
(rank:0 hostname:d071.dcsc.fysik.dtu.dk pid:20936):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/dataserv.c:armci_ReadFromDirect():439 cond:0
-10002:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10002 hostname:d031.dcsc.fysik.dtu.dk pid:22561):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_AcceptSockAll():673 cond:0
2:Child process terminated prematurely, status=: 256
(rank:2 hostname:d031.dcsc.fysik.dtu.dk pid:22558):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigChldHandler():178 cond:0
The http://nwchemgit.github.io/images/Nwchem-6.1.1-src.2012-06-27.tar.gz is built against openmpi 1.3.3 with torque support, with the following script (irrelavant parts of the filesystem paths are replaced by ...) on CentOS 5, x86_64:
export NWCHEM_TOP=/.../nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export LD_LIBRARY_PATH=/.../lib64
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPIEXEC=/.../bin/mpiexec
export MPI_LIB=/.../lib64
export MPI_INCLUDE=/.../include/
export LIBMPI='-L/.../lib64 -lmpi -lmpi_f90 -lmpi_f77'
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export TCGRSH=ssh
export PYTHONHOME=/usr
export PYTHONVERSION=2.4
export PYTHONLIBTYPE=a
export USE_PYTHON64=y
export HAS_BLAS=yes
export BLASOPT="-L/usr/lib64 -lblas -llapack"
make nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee make_nwchem_config.log
make 64_to_32 2>&1 | tee make_64_to_32.log
make USE_64TO32=y 2>&1 | tee make.log
I run the following example (with mpiexec `which nwchem` nwchem.nw):
geometry noautoz noautosym
O 0.0 0.0 1.245956
O 0.0 0.0 0.0
end
basis spherical
\* library cc-pvdz
end
dft
mult 3
xc xpbe96 cpbe96
smear 0.0
direct
noio
end
task dft energy
I have tried also to specify the PBS_NODEFILE explicitly with --hostfile ${PBS_NODEFILE}.
On the nodes, i see just one nwchem per node sitting with 100% of CPU, other instances are with 0 CPU load.
|