Errors Running Nwchem 6.3


Clicked A Few Times
Hi, recently I have been installing the nwchem 6.3 on a 720-core/1440-GB (60 dual-cpu/six-core/24GB nodes) 64-bit Xeon Cluster (from Dell), with InfiniBand network fabric. I have used the following configuration to compile the code.

export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/home/tpirojsi/nwchem-test-6.3
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=OPENIB
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-lrdmacm -libumad -libverbs -lpthread -lrt"
export MSG_COMMS=MPI
export CC=gcc
export FC=gfortran
export NWCHEM_MODULES="all"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/home/tpirojsi/MPI
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"


The code was successfully compiled and installed. For 1 node (12 cpus), there is no problem running the code. However, most of the time when I run the code using sge-base script for more than 1 node, I got the following errors.

qrsh_starter: cannot change to directory /home/tpirojsi/test: No such file or directory
qrsh_starter: cannot change to directory /home/tpirojsi/test: No such file or directory


A daemon (pid 25141) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.


OR
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 11 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 15.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.




mpirun noticed that process rank 0 with PID 24769 on node compute-0-37.local exited on signal 9 (Killed).


These two errors most commonly occur. There were only few times I could successfully run the job with more than one node. I know the second error is commonly dealing with the communication between nodes but I have no idea why this happens frequently. This error always happens at the beginning after the job has been launched. Is there something to do with the cluster itself or with the way I compiled the code?

For the first error, even trying to search for some answers online, I still haven't got any solutions. Indeed, The working directory did exist when running the code but I don't know why the error showed such a thing.

Also, Is it possible that the code is aborted if there is not enough memory/swap, as a result from unterminated child process in the previous job not being killed? I have seen the swap memory for many nodes was filled (last column) even though the node has not been used.

HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS


global - - - - - - -
compute-0-0 lx26-amd64 12 0.04 23.6G 415.4M 1000.0M 0.0
compute-0-1 lx26-amd64 12 0.00 23.6G 412.6M 1000.0M 0.0
compute-0-10 lx26-amd64 12 0.00 23.6G 431.3M 996.2M 996.2M
compute-0-11 lx26-amd64 12 0.00 23.6G 538.2M 996.2M 0.0
compute-0-12 lx26-amd64 12 0.01 23.6G 524.1M 996.2M 0.0
compute-0-13 lx26-amd64 12 0.03 23.6G 416.2M 996.2M 0.0
compute-0-14 lx26-amd64 12 0.00 23.6G 426.4M 996.2M 996.2M
compute-0-15 lx26-amd64 12 0.00 23.6G 430.4M 996.2M 996.1M
compute-0-16 lx26-amd64 12 0.00 23.6G 494.1M 996.2M 0.0
compute-0-17 lx26-amd64 12 0.00 23.6G 425.0M 996.2M 0.0
compute-0-18 lx26-amd64 12 0.00 23.6G 477.4M 996.2M 995.9M
compute-0-19 lx26-amd64 12 0.00 23.6G 474.2M 996.2M 0.0
compute-0-2 lx26-amd64 12 0.01 23.6G 485.3M 996.2M 995.9M


If that is the case, are there any ways we could clear up the memory/swap or reboot the nodes before starting a new job?

Any advice is greatly appreciated.

Forum Vet
Shared memory segments
Tee,
Your memory might be taken by the shared memory segments allocated by the failed NWChem runs.
You can check their existence with the command

ipcs -a

There is a script that can clean all of these leftover segments. The scripts is shipped on any NWChem source as

$NWCHEM_TOP/src/tools/global/testing/ipcreset

HCeers, Edo

Clicked A Few Times
Quote:Edoapra Oct 7th 10:20 pm
Tee,
Your memory might be taken by the shared memory segments allocated by the failed NWChem runs.
You can check their existence with the command

ipcs -a

There is a script that can clean all of these leftover segments. The scripts is shipped on any NWChem source as

$NWCHEM_TOP/src/tools/global/testing/ipcreset

HCeers, Edo


Hi Edo.

Thank you for your reply. So far the problem I only see is the one with qrsh_starter which I still don't know how this problem arises. It randomly occurs though. I've tried your suggestion and this is what I got:

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status


Semaphore Arrays --------
key semid owner perms nsems


Message Queues --------
key msqid owner perms used-bytes messages


So it doesn't seem that the memory is used by failed NWChem runs, right?

Forum Vet
Have you checked if the NWChem processes have been properly terminated?

Clicked A Few Times
Quote:Edoapra Oct 8th 12:22 am
Have you checked if the NWChem processes have been properly terminated?



Edo, by running 'qstat' I didn't see any processes running, so I think the job has been terminated. Is that what you meant? However, even when the job was running, when I tried 'ipcs -a' command, it showed up nothing too. Did it sound weird to you?

job-ID prior name user state submit/start at queue slots ja-task-ID


 72312 0.50500 neb-0      tpirojsi     r     10/07/2013 18:06:23 batch.q@compute-0-41.local        12
[tpirojsi@ccom-boom test]$ ipcs -a


Shared Memory Segments --------
key shmid owner perms bytes nattch status


Semaphore Arrays --------
key semid owner perms nsems


Message Queues --------
key msqid owner perms used-bytes messages

Forum Vet
set number of threads to 1
Did you set the env. variables

OMP_NUM_THREADS

and

GOTO_NUM_THREADS


to 1?

http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq

Forum Vet
Tee
qstat is not checking the running processes on the compute nodes.
In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example)

Clicked A Few Times
Quote:Edoapra Oct 8th 4:46 pm
Tee
qstat is not checking the running processes on the compute nodes.
In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example)



Oh! Thank you for shedding some light on this for me. You are absolutely correct. I logged in to computer nodes, and punched in the 'ipcs -a' command and did see some shared memory segments showed up. I used the tools provided in nwchem package to clean them up and saw the swap space came back to normal. It seems to solve qrsh_starter problem too!

I really appreciate your help.

Tee

Clicked A Few Times
Quote:Tpirojsi Oct 8th 6:28 pm
Quote:Edoapra Oct 8th 4:46 pm
Tee
qstat is not checking the running processes on the compute nodes.
In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example)



Oh! Thank you for shedding some light on this for me. You are absolutely correct. I logged in to computer nodes, and punched in the 'ipcs -a' command and did see some shared memory segments showed up. I used the tools provided in nwchem package to clean them up and saw the swap space came back to normal. It seems to solve qrsh_starter problem too!

I really appreciate your help.

Tee



Indeed, I still see some qrsh_starter errors but very few. Do you have any idea what is it related to?


Forum >> NWChem's corner >> Running NWChem