Memory usage from EOM-CCSD log


Clicked A Few Times
Hi All,

I'm running benzene EOM-CCSD, mostly just to systematically understand the memory and performance scaling. With benzene, using the default TCE IO setting (in-memory, I believe), and 2 ranks (1 per physical node), I can run the job to completion on nodes with 256 GB, but not on nodes with 64 GB. The GA and local heap/stack statistics at the end of the job suggest about 56.6 GB in these 3 categories, which should fit with the OS in 64 GB, but I gather the in-memory files are not included in that.

So my question is, should I just look at these lines

 2-e file size   =       4002717440
...
 x1 file size   =             6316
...
 x2 file size   =         28284240


to get the virtual file memory usage? If so, what are these units?

Also, are there other major consumers of memory that need to be accounted for when assessing NWChem TCE memory usage?

Thanks! - Chris

Essentials: NWChem 6.8 master branch source, Intel 2017.0.5 with MKL, CentOS 7.3 on x86-64, Mellanox IB (ARMCI_NETWORK=OPENIB).

Clicked A Few Times
I'm still stuck here, now at 16 ranks over 8 nodes. Each node has 64 GB, so I've set memory to

memory total 32000 mb global 30000 mb stack 500 mb


based on 8-rank usage with 1 rank/node, which was

heap 803 MB, stack 414 MB, global 13.2 GB, files 13.2 GB -> 27.6 GB/rank

I also set ARMCI_DEFAULT_SHMMAX to 2048. However, I'm still getting
0: error ival=4
(rank:0 hostname:n2149 pid:29429):ARMCI DASSERT fail. ../../ga-5.6.5/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
2: error ival=4
(rank:2 hostname:n2146 pid:19600):ARMCI DASSERT fail. ../../ga-5.6.5/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
...


FWIW, I don't understand this statement from the documentation:
Quote:

MEMORY
This is a start-up directive that allows the user to specify the amount of memory PER PROCESSOR CORE that NWChem can use for the job.

What if I'm using only 2 cores and no threads--then is it memory/rank, or still just total memory segment size divided by core count? Does threading affect this?

Any suggestions? Thanks

Clicked A Few Times
OK, all clear. Just had to jack ARMCI_DEFAULT_SHMMAX up, and set memory per-rank.

FYI, release version 6.8 left behind a 4 GB shared memory segment after the first part of a compound job that knackered the second part of the job. I could see and clear it by splitting the job into two pieces. Since ARMCI_DEFAULT_SHMMAX was set to 4096, I'm assuming it's probably GA-related. Still, something to know I suppose.


Forum >> NWChem's corner >> Running NWChem