Running with OpenMPI on multiple nodes failing


Click here for full thread
Clicked A Few Times
So I now a clue as to what's going on: lack of Infiniband memory. Here are the relevant error messages, although I am unsure how to resolve the problem:

1) This from the PBS standard output:

(rank:0 hostname:hbar11 pid:14250):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))
(rank:16 hostname:hbar9 pid:1254):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))


2) This from the PBS standard error ... note that my attempt to increase locked memory using "ulimit -l 800000" which works in .bashrc from an interactive session failed from the PBS job:

/home/lusol/.bashrc: line 14: ulimit: max locked memory: cannot modify limit: Operation not permitted
mpirun (Open MPI) 1.6.5

Report bugs to http://www.open-mpi.org/community/help/
/var/spool/pbs/mom_priv/jobs/1823.hbar1.cc.lehigh.edu.SC: line 42: ulimit: max locked memory: cannot modify limit: Operation not permitted
Last System Error Message from Task 0:: Cannot allocate memory
Last System Error Message from Task 16:: Cannot allocate memory


MPI_ABORT was invoked on rank 16 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 1.

...
...
...

Stack trace terminated abnormally.
[hbar11:14249] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[hbar11:14249] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


3) Finally, this from the NWchem output file:

more nwOutput.dat 
argument 1 = nwInput.dat
(rank:0 hostname:hbar11 pid:14250):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))
(rank:16 hostname:hbar9 pid:1254):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))


Thanks for any insight.