Let start with the beginning:
A. The memory keyword in the input specifies the memory per process, generally per processor and NOT per job.
Hence, if you tried to specify "memory total 22 gb" with 8 processors on one node, that means you are asking for 178 gbyte on one node to make this job run.
B. When you specify "memory total xxx mb", the amount xxx gets split up in 25% heap, 25% stack, and 50% global.
Heap: For most applications heap is not important and could be a much smaller block of memory. Generally we set this to 100 mb at most if we specify explicitly.
Stack: Effectively your local memory for each processor to use for the calculations.
Global: Memory used to store arrays that are globally accessible. Effectively it has a block of the <size global> times <# of processors used on node>, which can get very big.
C. Specifying memory explicitly, I recommend you use the format:
memory heap 100 mb stack 1000 mb global 2400 mb
The example here makes available 3500 mb, 3.5 Gbyte per processor and would require 3.5 Gbyte times the # of processors running on the node to be physically available. You cannot use virtual memory. You also need to leave space for the OS, so the above example we use when we have 8 processors and 32 gbyte of memory per node.
D. How much memory does the calculation need? The amount and distribution of stack and global needed is strongly dependent on the application. Generally an equal distribution works fine to start with. The code will indicate if it runs out of local or global memory, and you can redistribute. For coupled cluster (TCE) calculations you will generally need more global than stack memory (above example is a TCE style input). Tiling is important for TCE, to reduce local memory requirements.
E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.
Hope this helps,
Bert
|