NWCHEM fails to run on multiple nodes


Click here for full thread
Forum Vet
How memory allocation works in NWChem
First of all, using the DDFLT_TOT_MEM environment variable and recompiling to set memory usage does not make sense. NWChem has an "memory" [input keyword] that allows you define the amount of memory used by each processor during a simulation.

The shared memory in the input for the memory keyword is the global memory, and that is associated with the settings for ARMCI_DEFAULT_SHMMAX, which should be about the size or a little larger than the amount of shared memory of all the nodes that is being used.

By default, if no memory is allocated per processor in the input, it will use the precompiled default. Looking at the 259738112 from Mef362, (this is in doubles), Mef362 has 2 Gbyte available per core. By default, this will be split into 25% heap, 25% stack and 50% global. So, 1 Gbyte of global or shared per processor.

Now lets get back to the ARMCI_DEFAULT_SHMMAX. If you have X cores running on a node, and for each core you specify the shared memory to be Y (this is the global memory in the input, which is per core), your ARMCI_DEFAULT_SHMMAX should be set to X*Y, and this number should be smaller than the shmmax set in the kernel. In the current released version there is a present maximum that is allowed for ARMCI_DEFAULT_SHMMAX, which is 8 Gbyte. We will address this in the next release, of if you know how to code in c I can provide you with the code to change (bert.dejong@pnl.gov).

So, for Mef262 I would recommend:

1. In the input use:
   
memory heap 100 mb stack 500 mb global 500 mb

2. Set ARMCI_DEFAULT_SHMMAX to 8092

From a hardware point of view, you also need to make sure that the system parameters have been set to use shared memory segments that are 8 Gbyte in size.

ARMCI_DEFAULT_SHMMAX has to be less or equal than kernel.shmmax.

For example, if the value of kernel.shmmax is 4294967296 as in the example below,
ARMCI_DEFAULT_SHMMAX can be at most 4096 (4294967296=4096*1024*1024)

$ sysctl kernel.shmmax
kernel.shmmax = 4294967296

Hence, make sure that your kernel.shmmax is at least 8092*1024*1024.

Bert