7:57:15 AM PDT - Mon, May 21st 2012 |
|
Someone in the lab that I work in has been trying to run some calculations with NWChem and I've been trying to help him get started running it with MPI. It has been a bumpy ride, however. At first, we were getting problems about not being able to allocate a shared block of memory. SHMMAX was already plenty high (as large as all of the physical memory), so I created a swap file, started calculations again, and waited.
Now they are crashing again, but this time with a much different error message. From what I can tell, it is another problem with allocating shared memory, but it seems like NWChem is passing an invalid (i.e., negative) number to the allocation function.
Here is the relevant error message:
0:CreateSharedRegion:kr_malloc failed KB=: -772361
(rank:0 hostname:vivaldi.chem.utk.edu pid:3128):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:Create_Shared_Region():1188 cond:0
Last System Error Message from Task 0:: Numerical result out of range
application called MPI_Abort(comm=0x84000007, -772361) - process 0
rank 0 in job 2 vivaldi.chem.utk.edu_35229 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
And just in case, here is a link to the full output file:
http://web.eecs.utk.edu/~dbauer3/nwchem/macrofe_full631f.out
Running under Ubuntu 11.04 with MPICH2.
|