Hi,
I have trouble running any job on a 12 core cluster running Ubuntu 12.04, with each node (6 identical nodes) having each 2gb physical memory. I run NWChem on OpenMPI and this is the error when i try to run it on more than one node.
This is the error message.
argument 1 = water.nw
0:Terminate signal was sent, status=: 15
(rank:0 hostname:cm07 pid:4375):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0
4 total processes killed (some possibly by mpirun during cleanup)
argument 1 = water.nw
2:attach error:id=98304 off=33553984 seg=32768
3:attach error:id=98304 off=33553984 seg=32768
******************* ARMCI INFO ************************
******************* ARMCI INFO ************************
The application attempted to allocate a shared memory segment of 33554432 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.
*******************************************************
2:Attach_Shared_Region:failed to attach to segment id=: 98304
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 98304.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
(rank:2 hostname:cm07 pid:3749):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:Attach_Shared_Region():1050 cond:0
Last System Error Message from Task 2:: Invalid argument
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 3749 on
node cm07.02 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
The message repeats depending on the number of processors I run the job on.
This is my input file
start h2o_freq
charge 1
geometry units angstroms
O 0.0 0.0 0.0
H 0.0 0.0 1.0
H 0.0 1.0 0.0
end
basis
H library sto-3g
O library sto-3g
end
scf
uhf; doublet
print low
end
title "H2O+ : STO-3G UHF geometry optimization"
task scf optimize
My nodes are connected through cat 5e cables to a switch. I cannot run any job at all, does not even start the nwchem output.
I have tried changing shmmax values, but they do not work even when I set them to be the physical memory of one node or the total sum of all nodes.
Thank you for your attention.
|