5:26:04 AM PST - Tue, Dec 18th 2012 |
|
Hi,
I'm running coupled cluster calculation on a large cluster machine that has 12 cores and 22 GB in usable memory per node. Now occasionally I run into ARMCI DASSERT fail type errors I think there might be something wrong with my shared memory settings.
Based on previous discussions on the forum I always set the following in my script files:
ulimit -s unlimited
export ARMCI_DEFAULT_SHMMAX=2096
unset MA_USE_ARMCI_MEM
But I'm unsure how to set the ARMCI_DEFAULT_SHMMAX value properly and how it relates to the memory per node, how many cores on the node I actually use (the degree of underpopulation), the amount of shared memory I request in my input file and the value of kernel shxmax (found using cat /proc/sys/kernel/shmmax), which is 68719476736 in my case. For example, I found in a previous post somewhere that ARMCI_DEFAULT_SHMMAX should be larger than shmmax, which would suggest I need to set ARMCI_DEFAULT_SHMMAX to 65536 (which in itself feels very large).
It would be great if someone could tell me how to set ARMCI_DEFAULT_SHMMAX properly and how it relates to the above mentioned properties.
Thanks in advance,
Martijn
|