memory problem in parallel running "ARMCI DASSERT fail"


Click here for full thread
Clicked A Few Times
Edo,

Thanks for your advise, but this still does not work actually.

When I setenv ARMCI_DEFAULT_SHMMAX 36000, comes this warning:
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=36000

and the output error is:
 2-e (intermediate) file size =       279803475
 2-e (intermediate) file name = ./tmp/fh2.v2i
14: WARNING:armci_set_mem_offset: offset changed 0 to 12124160
13: WARNING:armci_set_mem_offset: offset changed 0 to 12124160
18: WARNING:armci_set_mem_offset: offset changed 0 to 12124160
1: WARNING:armci_set_mem_offset: offset changed 0 to 12128256
2: WARNING:armci_set_mem_offset: offset changed 0 to 12124160
6: WARNING:armci_set_mem_offset: offset changed 0 to 12124160
25: WARNING:armci_set_mem_offset: offset changed 0 to 11993088
31: WARNING:armci_set_mem_offset: offset changed 67596288 to 79589376
(rank:24 hostname:compute-11-3.local pid:8753):ARMCI DASSERT fail. ../../ga-5-1/armci/sr
c/devices/openib/openib.c:armci_server_register_region():1124 cond:(memhdl->memhndl!=((v
oid *)0))
Last System Error Message from Task 24:: Cannot allocate memory
(rank:12 hostname:compute-11-4.local pid:4225):ARMCI DASSERT fail. ../../ga-5-1/armci/sr
c/devices/openib/openib.c:armci_server_register_region():1124 cond:(memhdl->memhndl!=((v
oid *)0))
Last System Error Message from Task 12:: Cannot allocate memory
(rank:0 hostname:compute-11-32.local pid:22892):ARMCI DASSERT fail. ../../ga-5-1/armci/s
rc/devices/openib/openib.c:armci_server_register_region():1124 cond:(memhdl->memhndl!=((
void *)0))
Last System Error Message from Task 0:: Cannot allocate memory
application called MPI_Abort(comm=0x84000003, 1) - process 24
application called MPI_Abort(comm=0x84000003, 1) - process 12
application called MPI_Abort(comm=0x84000003, 1) - process 0
rank 24 in job 11  i11-32_41520   caused collective abort of all ranks
  exit status of rank 24: killed by signal 9 



But if I setenv ARMCI_DEFAULT_SHMMAX <= 8192, it do not run!

As a result I still haven't find the bottleneck.



Quote:Edoapra Nov 2nd 11:17 am
Psd,
Your calculations are likely to be crashing while creating shared memory segments.
If you set the environmental variable ARMCI_DEFAULT_SHMMAX to a value of 2048 (or larger),
you should be able to overcome this problem.
Please keep in mind that
ARMCI_DEFAULT_SHMMAX has to be greater or equal than the kernel parameter kernel.shmmax
(Root can only change kernel.shmmax, therefore you might have to ask the system
administrator to do it).
For example, if the value of kernel.shmmax is 4294967296 as in the example below,
ARMCI_DEFAULT_SHMMAX can be at most 4096 (4294967296=4096*1024*1024)

$ sysctl kernel.shmmax
kernel.shmmax = 4294967296

Cheers, Edo