Errors Running Nwchem 6.3

Clicked A Few Times

5:59:12 PM PDT - Wed, Oct 2nd 2013
Hi, recently I have been installing the nwchem 6.3 on a 720-core/1440-GB (60 dual-cpu/six-core/24GB nodes) 64-bit Xeon Cluster (from Dell), with InfiniBand network fabric. I have used the following configuration to compile the code. export LARGE_FILES=TRUE export USE_NOFSCHECK=TRUE export LIB_DEFINES="-DDFLT_TOT_MEM=16777216" export TCGRSH=/usr/bin/ssh export NWCHEM_TOP=/home/tpirojsi/nwchem-test-6.3 export NWCHEM_TARGET=LINUX64 export ARMCI_NETWORK=OPENIB export IB_HOME=/usr export IB_INCLUDE=$IB_HOME/include export IB_LIB=$IB_HOME/lib64 export IB_LIB_NAME="-lrdmacm -libumad -libverbs -lpthread -lrt" export MSG_COMMS=MPI export CC=gcc export FC=gfortran export NWCHEM_MODULES="all" export USE_MPI=y export USE_MPIF=y export USE_MPIF4=y export MPI_LOC=/home/tpirojsi/MPI export MPI_LIB=$MPI_LOC/lib export MPI_INCLUDE=$MPI_LOC/include export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil" The code was successfully compiled and installed. For 1 node (12 cpus), there is no problem running the code. However, most of the time when I run the code using sge-base script for more than 1 node, I got the following errors. qrsh_starter: cannot change to directory /home/tpirojsi/test: No such file or directory qrsh_starter: cannot change to directory /home/tpirojsi/test: No such file or directory A daemon (pid 25141) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. OR -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 11 in communicator MPI COMMUNICATOR 4 DUP FROM 0 with errorcode 15. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. mpirun noticed that process rank 0 with PID 24769 on node compute-0-37.local exited on signal 9 (Killed). These two errors most commonly occur. There were only few times I could successfully run the job with more than one node. I know the second error is commonly dealing with the communication between nodes but I have no idea why this happens frequently. This error always happens at the beginning after the job has been launched. Is there something to do with the cluster itself or with the way I compiled the code? For the first error, even trying to search for some answers online, I still haven't got any solutions. Indeed, The working directory did exist when running the code but I don't know why the error showed such a thing. Also, Is it possible that the code is aborted if there is not enough memory/swap, as a result from unterminated child process in the previous job not being killed? I have seen the swap memory for many nodes was filled (last column) even though the node has not been used. HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS global - - - - - - - compute-0-0 lx26-amd64 12 0.04 23.6G 415.4M 1000.0M 0.0 compute-0-1 lx26-amd64 12 0.00 23.6G 412.6M 1000.0M 0.0 compute-0-10 lx26-amd64 12 0.00 23.6G 431.3M 996.2M 996.2M compute-0-11 lx26-amd64 12 0.00 23.6G 538.2M 996.2M 0.0 compute-0-12 lx26-amd64 12 0.01 23.6G 524.1M 996.2M 0.0 compute-0-13 lx26-amd64 12 0.03 23.6G 416.2M 996.2M 0.0 compute-0-14 lx26-amd64 12 0.00 23.6G 426.4M 996.2M 996.2M compute-0-15 lx26-amd64 12 0.00 23.6G 430.4M 996.2M 996.1M compute-0-16 lx26-amd64 12 0.00 23.6G 494.1M 996.2M 0.0 compute-0-17 lx26-amd64 12 0.00 23.6G 425.0M 996.2M 0.0 compute-0-18 lx26-amd64 12 0.00 23.6G 477.4M 996.2M 995.9M compute-0-19 lx26-amd64 12 0.00 23.6G 474.2M 996.2M 0.0 compute-0-2 lx26-amd64 12 0.01 23.6G 485.3M 996.2M 995.9M If that is the case, are there any ways we could clear up the memory/swap or reboot the nodes before starting a new job? Any advice is greatly appreciated.

Forum Vet

3:20:16 PM PDT - Mon, Oct 7th 2013
Shared memory segments
Tee, Your memory might be taken by the shared memory segments allocated by the failed NWChem runs. You can check their existence with the command `ipcs -a` There is a script that can clean all of these leftover segments. The scripts is shipped on any NWChem source as `$NWCHEM_TOP/src/tools/global/testing/ipcreset` HCeers, Edo

Clicked A Few Times

4:23:42 PM PDT - Mon, Oct 7th 2013
Quote:Edoapra Oct 7th 10:20 pm Tee, Your memory might be taken by the shared memory segments allocated by the failed NWChem runs. You can check their existence with the command `ipcs -a` There is a script that can clean all of these leftover segments. The scripts is shipped on any NWChem source as `$NWCHEM_TOP/src/tools/global/testing/ipcreset` HCeers, Edo Hi Edo. Thank you for your reply. So far the problem I only see is the one with qrsh_starter which I still don't know how this problem arises. It randomly occurs though. I've tried your suggestion and this is what I got: ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status Semaphore Arrays -------- key semid owner perms nsems Message Queues -------- key msqid owner perms used-bytes messages So it doesn't seem that the memory is used by failed NWChem runs, right?

Forum Vet

5:22:42 PM PDT - Mon, Oct 7th 2013
Have you checked if the NWChem processes have been properly terminated?

Clicked A Few Times

6:09:39 PM PDT - Mon, Oct 7th 2013
Quote:Edoapra Oct 8th 12:22 am Have you checked if the NWChem processes have been properly terminated? Edo, by running 'qstat' I didn't see any processes running, so I think the job has been terminated. Is that what you meant? However, even when the job was running, when I tried 'ipcs -a' command, it showed up nothing too. Did it sound weird to you? job-ID prior name user state submit/start at queue slots ja-task-ID 72312 0.50500 neb-0 tpirojsi r 10/07/2013 18:06:23 batch.q@compute-0-41.local 12 [tpirojsi@ccom-boom test]$ ipcs -a Shared Memory Segments -------- key shmid owner perms bytes nattch status Semaphore Arrays -------- key semid owner perms nsems Message Queues -------- key msqid owner perms used-bytes messages

Forum Vet

9:44:50 AM PDT - Tue, Oct 8th 2013

set number of threads to 1

Did you set the env. variables


OMP_NUM_THREADS

and


GOTO_NUM_THREADS

to 1?

http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq

Forum Vet

9:46:11 AM PDT - Tue, Oct 8th 2013
Tee qstat is not checking the running processes on the compute nodes. In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example)

Clicked A Few Times

11:28:05 AM PDT - Tue, Oct 8th 2013
Quote:Edoapra Oct 8th 4:46 pm Tee qstat is not checking the running processes on the compute nodes. In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example) Oh! Thank you for shedding some light on this for me. You are absolutely correct. I logged in to computer nodes, and punched in the 'ipcs -a' command and did see some shared memory segments showed up. I used the tools provided in nwchem package to clean them up and saw the swap space came back to normal. It seems to solve qrsh_starter problem too! I really appreciate your help. Tee

Clicked A Few Times

1:48:59 PM PDT - Tue, Oct 8th 2013
Quote:Tpirojsi Oct 8th 6:28 pm Quote:Edoapra Oct 8th 4:46 pm Tee qstat is not checking the running processes on the compute nodes. In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example) Oh! Thank you for shedding some light on this for me. You are absolutely correct. I logged in to computer nodes, and punched in the 'ipcs -a' command and did see some shared memory segments showed up. I used the tools provided in nwchem package to clean them up and saw the swap space came back to normal. It seems to solve qrsh_starter problem too! I really appreciate your help. Tee Indeed, I still see some qrsh_starter errors but very few. Do you have any idea what is it related to?

Forum >> NWChem's corner >> Running NWChem