5:59:12 PM PDT - Wed, Oct 2nd 2013 |
|
Hi, recently I have been installing the nwchem 6.3 on a 720-core/1440-GB (60 dual-cpu/six-core/24GB nodes) 64-bit Xeon Cluster (from Dell), with InfiniBand network fabric. I have used the following configuration to compile the code.
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/home/tpirojsi/nwchem-test-6.3
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=OPENIB
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-lrdmacm -libumad -libverbs -lpthread -lrt"
export MSG_COMMS=MPI
export CC=gcc
export FC=gfortran
export NWCHEM_MODULES="all"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/home/tpirojsi/MPI
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
The code was successfully compiled and installed. For 1 node (12 cpus), there is no problem running the code. However, most of the time when I run the code using sge-base script for more than 1 node, I got the following errors.
qrsh_starter: cannot change to directory /home/tpirojsi/test: No such file or directory
qrsh_starter: cannot change to directory /home/tpirojsi/test: No such file or directory
A daemon (pid 25141) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
OR
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 11 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 15.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
mpirun noticed that process rank 0 with PID 24769 on node compute-0-37.local exited on signal 9 (Killed).
These two errors most commonly occur. There were only few times I could successfully run the job with more than one node. I know the second error is commonly dealing with the communication between nodes but I have no idea why this happens frequently. This error always happens at the beginning after the job has been launched. Is there something to do with the cluster itself or with the way I compiled the code?
For the first error, even trying to search for some answers online, I still haven't got any solutions. Indeed, The working directory did exist when running the code but I don't know why the error showed such a thing.
Also, Is it possible that the code is aborted if there is not enough memory/swap, as a result from unterminated child process in the previous job not being killed? I have seen the swap memory for many nodes was filled (last column) even though the node has not been used.
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
global - - - - - - -
compute-0-0 lx26-amd64 12 0.04 23.6G 415.4M 1000.0M 0.0
compute-0-1 lx26-amd64 12 0.00 23.6G 412.6M 1000.0M 0.0
compute-0-10 lx26-amd64 12 0.00 23.6G 431.3M 996.2M 996.2M
compute-0-11 lx26-amd64 12 0.00 23.6G 538.2M 996.2M 0.0
compute-0-12 lx26-amd64 12 0.01 23.6G 524.1M 996.2M 0.0
compute-0-13 lx26-amd64 12 0.03 23.6G 416.2M 996.2M 0.0
compute-0-14 lx26-amd64 12 0.00 23.6G 426.4M 996.2M 996.2M
compute-0-15 lx26-amd64 12 0.00 23.6G 430.4M 996.2M 996.1M
compute-0-16 lx26-amd64 12 0.00 23.6G 494.1M 996.2M 0.0
compute-0-17 lx26-amd64 12 0.00 23.6G 425.0M 996.2M 0.0
compute-0-18 lx26-amd64 12 0.00 23.6G 477.4M 996.2M 995.9M
compute-0-19 lx26-amd64 12 0.00 23.6G 474.2M 996.2M 0.0
compute-0-2 lx26-amd64 12 0.01 23.6G 485.3M 996.2M 995.9M
If that is the case, are there any ways we could clear up the memory/swap or reboot the nodes before starting a new job?
Any advice is greatly appreciated.
|
|