vrange nnbf - Out of memory events?


Click here for full thread
Just Got Here
Hey there,

I recently got a problem with some MP2 calculations using several nodes on our HPC cluster. So the calculation itself runs a while and then terminates with no real error code, just like this:




   NWChem MP2 Semi-direct Energy/Gradient Module



BCN_sAz_TS12_lv3 IDA


Basis functions = 682
Molecular orbitals = 675
Frozen core = 11
Frozen virtuals = 0
Active alpha occupied = 30
Active beta occupied = 30
Active alpha virtual = 634
Active beta virtual = 634
Use MO symmetry = F
Use skeleton AO sym = F

AO/Fock/Back tols = 1.0D-09 1.0D-09 1.0D-09

GA uses MA = F GA memory limited = T

Available:
local mem= 2.01D+08
global mem= 2.01D+08
local disk= 1.67D+11
mp2_memr nvloc 8
nvloc new 8
1 passes of 30: 2914906 710274 98697852.

Semi-direct pass number 1 of 1 for RHF alpha+beta at 3341.8s
vrange nnbf 170380
srun: error: n15-006: task 2: Killed
srun: Terminating job step 4189881.0
srun: error: n15-007: task 17: Killed
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: n15-028: task 65: Killed
srun: error: n15-015: task 62: Killed
srun: error: n15-013: tasks 32-47: Killed
srun: error: n15-028: tasks 64,66-79: Killed
srun: error: n15-007: tasks 16,18-31: Killed
srun: error: n15-015: tasks 48-61,63: Killed
slurmstepd: error: *** STEP 4189881.0 ON n15-006 CANCELLED AT 2016-09-08T16:41:28 ***
srun: error: n15-006: tasks 0-1,3-15: Killed




What does " vrange nnbf 170380" mean?

I get out-of-memory notifications for those jobs like that:



 jobid: 4188259
n14-049: kernel Killed process 28251, UID 71897, (nwchem) total-vm 6362392kB, anon-rss 4213752kB, file-rss 137892kB
jobid: 4188258
n07-080: kernel Killed process 31758, UID 71897, (nwchem) total-vm 6571668kB, anon-rss 4089672kB, file-rss 138132kB
n07-080: kernel Killed process 31747, UID 71897, (nwchem) total-vm 6570100kB, anon-rss 4396084kB, file-rss 137876kB
n07-082: kernel Killed process 2139, UID 71897, (nwchem) total-vm 6461920kB, anon-rss 4055784kB, file-rss 138008kB
n07-082: kernel Killed process 2177, UID 71897, (nwchem) total-vm 6461920kB, anon-rss 4055788kB, file-rss 137676kB
n07-082: kernel Killed process 2149, UID 71897, (nwchem) total-vm 6509952kB, anon-rss 4349100kB, file-rss 137732kB



Now this is where my problem is. I'm no IT guy so I don't really understand what that means. I know that this means nwchem tried to use more memory than there is available. The message says about 6.5GB in total. Now I guess here's the problem? I run that job on 5 nodes, 16 cores each, tasks per node 16 (so no multithreading, nwchem also states nproc = 80, which is, I guess, the number of processors) and 64GB of RAM on each node, with "memory 3 GB" setting in nwchem, so it shouldn't use more than 48GB per node in total? So does it have a separate process per core? Because in the message above there are listed 3 processes on node n07-082, each taking up about 6.5GB, which it shouldn't do.

Now the really strange thing is: this only happens for the smaller structures. Bigger structures (about 50% more heavy atoms) with the same settings don't show this problem at all. And even for the smaller ones, same input might work without a problem if tried again.

Maybe someone can help me, I wanted to ask here first before talking to IT.

Thanks a lot.