Is there a systematic way of finding out how much memory is needed?


Clicked A Few Times
Dear nwchem users,
I'm using nwchem on infiniband cluster and strugling with memory problems when doing TDDFT. The input is:

Title "dye2Nex"
Start dye2Nex
set fock:replicated logical .false.

permanent_dir /Data/Users/syesylevsky/QM/dye2/N
memory total 400 mb

echo
charge 0

geometry noautosym units angstrom
C     0.00000     0.00000     0.00000
C 1.36800 0.00000 0.00000
C -0.774000 1.26900 0.00000
C 0.0560000 2.48300 0.00300000
O 2.11800 1.15900 -0.00500000
O -0.652000 -1.20200 0.00400000
C 2.28500 3.49300 0.00900000
C 1.70000 4.74800 0.0160000
C 0.309000 4.88600 0.0130000
C -0.507000 3.76700 0.00500000
O -1.99700 1.27200 0.00200000
C 1.45200 2.36000 0.00300000
H -1.58400 -1.02300 0.0550000
H 3.37500 3.38100 0.00500000
H 2.33400 5.64100 0.0240000
H -0.135000 5.88700 0.0160000
H -1.59900 3.87300 -0.00100000
C 2.22300 -1.17800 -0.00300000
C 4.14100 -2.24800 0.313000
O 3.55400 -0.999000 0.414000
C 5.46200 -2.57000 0.622000
C 5.82700 -3.89700 0.443000
C 4.91700 -4.85600 -0.0240000
C 3.60400 -4.52700 -0.330000
C 1.97000 -2.48200 -0.356000
C 3.20900 -3.20300 -0.158000
H 6.16900 -1.81700 0.984000
H 5.25600 -5.89000 -0.149000
H 2.89600 -5.27800 -0.693000
H 1.03900 -2.91400 -0.717000
H 6.85300 -4.20500 0.672000
end

ecce_print ecce.out

basis "ao basis" spherical print
H library "3-21G"
O library "3-21G"
C library "3-21G"
END

dft
 mult 1
XC b3lyp
iterations 5000
mulliken
direct
end

driver
 default
maxiter 2000
end

tddft
 nroots 3
target 1
end

task tddft optimize


When I'm running this I get the following error:

2: error ival=5
(rank:2 hostname:mesocomte87 pid:9679):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193
cond:(pdscr->status==IBV_WC_SUCCESS)
1: error ival=10
(rank:1 hostname:mesocomte65 pid:18582):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_send_complete():459
cond:(pdscr->status==IBV_WC_SUCCESS)
5: error ival=10
(rank:5 hostname:mesocomte19 pid:20956):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193
cond:(pdscr->status==IBV_WC_SUCCESS)
0:Terminate signal was sent, status=: 15
(rank:0 hostname:mesocomte21 pid:30562):ARMCI DASSERT fail.
../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0


As it was advised on this forum I set
export ARMCI_DEFAULT_SHMMAX=2048
but this does not help. I spent a lot of time playing with different memory values and finally got it working with

memory stack 150 mb heap 50 mb global 200 mb

but this was a blind guesswork, which I really don't want to do for every new system or basis level.

EDIT: it crashed after few hours. I still can't get it running.

Is there a good systematic way of finding out how much memory particular job needs to run normally in parallel environment? Which disgnostic messages should I use for this?

Thank you very much in advance!

Semen

Clicked A Few Times
I had some related problems recently. How much memory do you have on your system? Try increasing total memory drastically?

For an 8 processor job, I use:

memory total 22 gb

Clicked A Few Times
Quote:Andrew.yeung Nov 13th 9:30 am
I had some related problems recently. How much memory do you have on your system? Try increasing total memory drastically?

For an 8 processor job, I use:

memory total 22 gb[/quote]

In principle I can ask for up to 12 Gb per process, but than this job will stay in the queue forever (it will saturate the nodes compeletely and will get very low priority). My objective is to allocate just enough to get it running but keep waiting time reasonable. My molecule is rather small and on 1 CPU it runs under 1 Gb of memory, but I can't understand how to estimate memory consumption in parallel mode.

Forum Vet
How memory allocation works in NWChem
Let start with the beginning:

A. The memory keyword in the input specifies the memory per process, generally per processor and NOT per job.

Hence, if you tried to specify "memory total 22 gb" with 8 processors on one node, that means you are asking for 178 gbyte on one node to make this job run.

B. When you specify "memory total xxx mb", the amount xxx gets split up in 25% heap, 25% stack, and 50% global.

 Heap: For most applications heap is not important and could be a much smaller block of memory. Generally we set this to 100 mb at most if we specify explicitly.

 Stack: Effectively your local memory for each processor to use for the calculations.

 Global: Memory used to store arrays that are globally accessible. Effectively it has a block of the <size global> times <# of processors used on node>, which can get very big.

C. Specifying memory explicitly, I recommend you use the format:

   memory heap 100 mb stack 1000 mb global 2400 mb

The example here makes available 3500 mb, 3.5 Gbyte per processor and would require 3.5 Gbyte times the # of processors running on the node to be physically available. You cannot use virtual memory. You also need to leave space for the OS, so the above example we use when we have 8 processors and 32 gbyte of memory per node.

D. How much memory does the calculation need? The amount and distribution of stack and global needed is strongly dependent on the application. Generally an equal distribution works fine to start with. The code will indicate if it runs out of local or global memory, and you can redistribute. For coupled cluster (TCE) calculations you will generally need more global than stack memory (above example is a TCE style input). Tiling is important for TCE, to reduce local memory requirements.

E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.

Hope this helps,

Bert

Clicked A Few Times
Thanks for correcting my mistake, Bert.

Is there a reason why you break up memory this way (heap 100 mb/stack 1000 mb/global 2400 mb), instead of the 25-25-50% by default?

Clicked A Few Times
Hi Bert!

Thanks for your post, but I still have a question about the ARMCI_DEFAULT_SHMMAX.

Suppose I use 2 nodes with 16 cores each node, each core has 4GB memory, and I specify:
memory heap 100 mb stack 400 mb global 3200 mb

that is to say, I use 3200MB*16=51200MB global memory each node.
If I set
setenv ARMCI_DEFAULT_SHMMAX 51200

comes out this warning:
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200

Do you know what's the problem is?

Thank you!



Quote:Bert Nov 14th 1:20 pm
Let start with the beginning:

E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.

Hope this helps,

Bert

Forum Vet
Simply because most codes do not use that much stack memory, so it would be waisted.

Bert

Quote:Andrew.yeung Nov 14th 11:27 pm
Thanks for correcting my mistake, Bert.

Is there a reason why you break up memory this way (heap 100 mb/stack 1000 mb/global 2400 mb), instead of the 25-25-50% by default?

Forum Vet
Yes, the code right now has some internal limits. Henc,e you cannot set it to more than 8000 mb, mainly because this was based on fewer cores per node. I'll look at having this updated and tested.

I would have to suggest you do not set the stack that small if you want to run coupled cluster caculations, it will be more expensive as you are forced to use smaller blocks.

Bert


Quote:Psd Nov 16th 8:54 am
Hi Bert!

Thanks for your post, but I still have a question about the ARMCI_DEFAULT_SHMMAX.

Suppose I use 2 nodes with 16 cores each node, each core has 4GB memory, and I specify:
memory heap 100 mb stack 400 mb global 3200 mb

that is to say, I use 3200MB*16=51200MB global memory each node.
If I set
setenv ARMCI_DEFAULT_SHMMAX 51200

comes out this warning:
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200

Do you know what's the problem is?

Thank you!



Quote:Bert Nov 14th 1:20 pm
Let start with the beginning:

E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.

Hope this helps,

Bert

Clicked A Few Times
I understand the things in theory, but on practice I still can't get it working. Currently I have

memory total 4000 mb

It runs for few hours and than fails. The end of log is the following:


           Memory Information
------------------
Available GA space size is 524244319 doubles
Available MA space size is 65513497 doubles
Length of a trial vector is 9864
Algorithm : Incore multiple tensor contraction
Estimated peak GA usage is 182779852 doubles
Estimated peak MA usage is 6600 doubles

3 smallest eigenvalue differences (eV)


 No. Spin  Occ  Vir  Irrep   E(Vir)    E(Occ)   E(Diff)


   1    1   72   73 a        -0.071    -0.208     3.744
2 1 71 73 a -0.071 -0.239 4.578
3 1 70 73 a -0.071 -0.245 4.747



 Entering Davidson iterations
Restricted singlet excited states

 Iter   NTrls   NConv    DeltaV     DeltaE      Time   
---- ------ ------ --------- --------- ---------
0: error ival=-1
(rank:0 hostname:mesocomte68 pid:30430):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_server_rdma_strided_to_contig():3239 cond:(rc==0)


As far as I can see from Memory Information I have a lot of free memory, but it still fails. Could you please tell what's wrong? I wonder what is armci_server_rdma_strided_to_contig()...

Forum Vet
Not clear, seems to be related to the system. I would try and reduce the memory footprint. The output suggest you do not need that much memory in the first place.

Doing the numbers and info it looks like you are running on 2 processor cores, and each core is on a different node connected by IB? How many cores and how much memory do you have per node? You may be able to run this on a single node.

Bert


Quote:Yesint Nov 18th 8:36 am
I understand the things in theory, but on practice I still can't get it working. Currently I have

memory total 4000 mb

It runs for few hours and than fails. The end of log is the following:


           Memory Information
------------------
Available GA space size is 524244319 doubles
Available MA space size is 65513497 doubles
Length of a trial vector is 9864
Algorithm : Incore multiple tensor contraction
Estimated peak GA usage is 182779852 doubles
Estimated peak MA usage is 6600 doubles

3 smallest eigenvalue differences (eV)


 No. Spin  Occ  Vir  Irrep   E(Vir)    E(Occ)   E(Diff)


   1    1   72   73 a        -0.071    -0.208     3.744
2 1 71 73 a -0.071 -0.239 4.578
3 1 70 73 a -0.071 -0.245 4.747



 Entering Davidson iterations
Restricted singlet excited states

 Iter   NTrls   NConv    DeltaV     DeltaE      Time   
---- ------ ------ --------- --------- ---------
0: error ival=-1
(rank:0 hostname:mesocomte68 pid:30430):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_server_rdma_strided_to_contig():3239 cond:(rc==0)


As far as I can see from Memory Information I have a lot of free memory, but it still fails. Could you please tell what's wrong? I wonder what is armci_server_rdma_strided_to_contig()...

Clicked A Few Times
Quote:Bert Nov 18th 7:09 am
Not clear, seems to be related to the system. I would try and reduce the memory footprint. The output suggest you do not need that much memory in the first place.

Doing the numbers and info it looks like you are running on 2 processor cores, and each core is on a different node connected by IB? How many cores and how much memory do you have per node? You may be able to run this on a single node.

Bert


It runs over IB, one core per node. Each node has at least 12GB of RAM. I'll try to put it on the single node, but this is not what we want to do normally.


Forum >> NWChem's corner >> Running NWChem