Setting ARMCI DEFAULT SHMMAX properly


Clicked A Few Times
Hi,

I'm running coupled cluster calculation on a large cluster machine that has 12 cores and 22 GB in usable memory per node. Now occasionally I run into ARMCI DASSERT fail type errors I think there might be something wrong with my shared memory settings.

Based on previous discussions on the forum I always set the following in my script files:

ulimit -s unlimited

export ARMCI_DEFAULT_SHMMAX=2096

unset MA_USE_ARMCI_MEM

But I'm unsure how to set the ARMCI_DEFAULT_SHMMAX value properly and how it relates to the memory per node, how many cores on the node I actually use (the degree of underpopulation), the amount of shared memory I request in my input file and the value of kernel shxmax (found using cat /proc/sys/kernel/shmmax), which is 68719476736 in my case. For example, I found in a previous post somewhere that ARMCI_DEFAULT_SHMMAX should be larger than shmmax, which would suggest I need to set ARMCI_DEFAULT_SHMMAX to 65536 (which in itself feels very large).

It would be great if someone could tell me how to set ARMCI_DEFAULT_SHMMAX properly and how it relates to the above mentioned properties.

Thanks in advance,

Martijn

Forum Vet
Size
The size of ARMCI_DEFAULT_SHMMAX should be about the size or a little larger than the amount of shared memory that is being used.

If you have X cores running on a node, and for each core you specify the shared memory to be Y (this is the global memory in the input, which is per core), your ARMCI_DEFAULT_SHMMAX should be set to X*Y, and this number should be smaller than the shmmax set in the kernel. This should not be an issue as you have only 22 Gbyte per node and the shmmax is way larger than that.

I would say, for your system you have between 1.5-1.6 Gbyte available per core. If you do not specifically specify heap, stack, and global, and only a total, 1/2 of that goes to global shared memory.

Bert

Clicked A Few Times
Thank you very much Bert.

I've tried this for a EOM-CCSDT calculation with a def2-SV(P) basis-set but after a significant number of iterations the job dies with the following error message:

Iteration 18 using 54 trial vectors
72: error ival=4
(rank:72 hostname:red0501 pid:11964):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
0:Terminate signal was sent, status=: 15
(rank:0 hostname:red0050 pid:9814):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0

I run the calculation on 10 nodes of 12 cores and 22 GB memory each (i.e. > 1800 Mb per core). I set the memory in the inputfile as "MEMORY stack 800 mb heap 100 mb global 900 mb" and ARMCI_DEFAULT_SHMMAX in the script via "export ARMCI_DEFAULT_SHMMAX=10800". The SHMAX values is, as you suggested, equal to the amount of cores per node times the global memory per core (12*900).

So if we have declared enough memory for the global memory using ARMCI_DEFAULT_SHMMAX why am I still running into an ARMCI error? Does it simply mean the amount of memory is not sufficient or would you get a different error message in that case.

Thanks for your (further) help (in advance),

Martijn

Clicked A Few Times
Further to the above, I run into exactly the same problem. For a calculation on a slightly larger system. Here I use 20 nodes, 6 cores per node in use and the other 6 idle (i.e. I underpopulate the node). I set memory in the input file to "MEMORY stack 1000 mb heap 100 mb global 2550 mb" and set ARMCI_DEFAULT_SHMMAX in the run file to 15300. This time the EOM-CCSDT calculation stops in the ground state CCSDT calculation but with the same error message:

CCSDT iterations
--------------------------------------------------------
Iter Residuum Correlation Cpu Wall
--------------------------------------------------------
1 1.6034927654134 -1.1164830885139 290.1 287.2
84: error ival=4
(rank:84 hostname:red0627 pid:21926):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
0: error ival=10
(rank:0 hostname:red0003 pid:28395):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)

Ironically, a calculation with ARMCI_DEFAULT_SHMMAX set to 4096 (i.e. with much less memory and less than no. of cores x requested global memory per core) continuous well beyond this point (although I'm fairly sure it will eventually crash).

Are ARMCI_DEFAULT_SHMMAX values restricted to certain special values? I noticed that all the examples given on the forum tend to be integer multiples of 1024.

Thanks,

Martijn

Clicked A Few Times
You are simply running out of memory. Including triplets in your couple cluster calculations takes up huge amounts of RAM

Clicked A Few Times
EOM-CCSDT is indeed very expensive but I strongly believe this particular error message does not indicate that the calculation is "simply" running out of memory but suggests another problem.

I think my hypothesis is best indicated by the second set of calculations discussed above. Here calculations with a ARMCI_DEFAULT_SHMMAX value of global memory x cores (15300) crash during the first step of the ground state CCSDT step, while calculation with a much smaller ARMCI_DEFAULT_SHMMAX value (4096) successfully finishes the ground state CCSDT calculations and 15 EOM-CCSDT steps before crashing. This time with another error message, probably because ARMCI_DEFAULT_SHMMAX was smaller than global memory x cores.


Forum >> NWChem's corner >> Running NWChem