Error running any job on cluster running Ubuntu 12.04


Clicked A Few Times
Hi,
I have trouble running any job on a 12 core cluster running Ubuntu 12.04, with each node (6 identical nodes) having each 2gb physical memory. I run NWChem on OpenMPI and this is the error when i try to run it on more than one node.

This is the error message.
 argument  1 = water.nw
0:Terminate signal was sent, status=: 15
(rank:0 hostname:cm07 pid:4375):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0
4 total processes killed (some possibly by mpirun during cleanup)
 argument  1 = water.nw
2:attach error:id=98304 off=33553984 seg=32768
3:attach error:id=98304 off=33553984 seg=32768
******************* ARMCI INFO ************************
******************* ARMCI INFO ************************
The application attempted to allocate a shared memory segment of 33554432 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.
*******************************************************
2:Attach_Shared_Region:failed to attach to segment id=: 98304
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 4 DUP FROM 0 
with errorcode 98304.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
(rank:2 hostname:cm07 pid:3749):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:Attach_Shared_Region():1050 cond:0
Last System Error Message from Task 2:: Invalid argument
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 3749 on
node cm07.02 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).


The message repeats depending on the number of processors I run the job on.

This is my input file
start h2o_freq

charge 1
geometry units angstroms
 O       0.0  0.0  0.0
 H       0.0  0.0  1.0
 H       0.0  1.0  0.0
end
basis
 H library sto-3g
 O library sto-3g
end
scf
 uhf; doublet
 print low
end
title "H2O+ : STO-3G UHF geometry optimization"
task scf optimize


My nodes are connected through cat 5e cables to a switch. I cannot run any job at all, does not even start the nwchem output.

I have tried changing shmmax values, but they do not work even when I set them to be the physical memory of one node or the total sum of all nodes.

Thank you for your attention.

Forum Regular
Hi,
SHMMAX is a kernel parameter that specifies how much shared memory a process can allocate. As far as I can see Ubuntu seems to set this to 33554432 bytes (approx 33 MB) but NWChem will by default try to allocate about 200 MB. As a result the calculation fails right at the beginning. Setting SHMMAX as an environment variable is not going to solve this because it is a kernel parameter. You can set SHMMAX to anything you want in your environment but if the kernel is not going to let you have that much shared memory it won't help. So what you need to do is to change the kernel parameter. The following page shows how you can do this: http://www.linuxforums.org/forum/red-hat-fedora-linux/17025-how-can-i-change-shmmax.html. Obviously this page refers to a different version of Linux but I don't believe these kinds of basic things differ much between different Linux distributions. You may need root permissions to do this, so you may have to ask a system administrator to do this.
I hope this helps,
Huub

Clicked A Few Times
Quote:Huub Nov 28th 9:41 am
Hi,
SHMMAX is a kernel parameter that specifies how much shared memory a process can allocate. As far as I can see Ubuntu seems to set this to 33554432 bytes (approx 33 MB) but NWChem will by default try to allocate about 200 MB. As a result the calculation fails right at the beginning. Setting SHMMAX as an environment variable is not going to solve this because it is a kernel parameter. You can set SHMMAX to anything you want in your environment but if the kernel is not going to let you have that much shared memory it won't help. So what you need to do is to change the kernel parameter. The following page shows how you can do this: http://www.linuxforums.org/forum/red-hat-fedora-linux/17025-how-can-i-change-shmmax.html. Obviously this page refers to a different version of Linux but I don't believe these kinds of basic things differ much between different Linux distributions. You may need root permissions to do this, so you may have to ask a system administrator to do this.
I hope this helps,
Huub

Hi,
I have already set shmmax through /etc/sysctl.conf to a number larger than the one stated in the error message, across all nodes. The problem persists, however if i use mpirun across 2 cores locally, NWChem runs fine. Is there any other likely cause for this problem?
Thank you.

Forum Vet
Single node jobs
Do you get the same failure when you run with a single node using two core?
Thanks, Edo

Clicked A Few Times
no, the jobs run fine while using two cores on a single node


Forum >> NWChem's corner >> Running NWChem