vrange nnbf - Out of memory events?

Click here for full thread

Just Got Here

8:20:51 AM PDT - Thu, Sep 8th 2016

Hey there,

I recently got a problem with some MP2 calculations using several nodes on our HPC cluster. So the calculation itself runs a while and then terminates with no real error code, just like this:

   NWChem MP2 Semi-direct Energy/Gradient Module

 



  

                              BCN_sAz_TS12_lv3 IDA

  

  

 Basis functions       =    682

 Molecular orbitals    =    675

 Frozen core           =     11

 Frozen virtuals       =      0

 Active alpha occupied =     30

 Active beta occupied  =     30

 Active alpha virtual  =    634

 Active beta virtual   =    634

 Use MO symmetry       = F

 Use skeleton AO sym   = F

 

 AO/Fock/Back tols     =   1.0D-09  1.0D-09  1.0D-09

 

GA uses MA = F    GA memory limited = T

   

Available: 

 local mem=  2.01D+08

global mem=  2.01D+08

local disk=  1.67D+11

 mp2_memr nvloc                      8

 nvloc new                      8

  1 passes of  30:        2914906                     710274                  98697852.



Semi-direct pass number   1 of   1  for RHF alpha+beta  at     3341.8s

 vrange nnbf                 170380

 srun: error: n15-006: task 2: Killed

 srun: Terminating job step 4189881.0

 srun: error: n15-007: task 17: Killed

 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

 srun: error: n15-028: task 65: Killed

 srun: error: n15-015: task 62: Killed

 srun: error: n15-013: tasks 32-47: Killed

 srun: error: n15-028: tasks 64,66-79: Killed

 srun: error: n15-007: tasks 16,18-31: Killed

 srun: error: n15-015: tasks 48-61,63: Killed

 slurmstepd: error: *** STEP 4189881.0 ON n15-006 CANCELLED AT 2016-09-08T16:41:28 ***

 srun: error: n15-006: tasks 0-1,3-15: Killed

What does " vrange nnbf 170380" mean?

I get out-of-memory notifications for those jobs like that:

 jobid: 4188259

 n14-049: kernel Killed process 28251, UID 71897, (nwchem) total-vm 6362392kB, anon-rss 4213752kB, file-rss     137892kB

 jobid: 4188258

 n07-080: kernel Killed process 31758, UID 71897, (nwchem) total-vm 6571668kB, anon-rss 4089672kB, file-rss   138132kB

 n07-080: kernel Killed process 31747, UID 71897, (nwchem) total-vm 6570100kB, anon-rss 4396084kB, file-rss 137876kB

 n07-082: kernel Killed process 2139, UID 71897, (nwchem) total-vm 6461920kB, anon-rss 4055784kB, file-rss 138008kB

 n07-082: kernel Killed process 2177, UID 71897, (nwchem) total-vm 6461920kB, anon-rss 4055788kB, file-rss 137676kB

 n07-082: kernel Killed process 2149, UID 71897, (nwchem) total-vm 6509952kB, anon-rss 4349100kB, file-rss 137732kB

Now this is where my problem is. I'm no IT guy so I don't really understand what that means. I know that this means nwchem tried to use more memory than there is available. The message says about 6.5GB in total. Now I guess here's the problem? I run that job on 5 nodes, 16 cores each, tasks per node 16 (so no multithreading, nwchem also states nproc = 80, which is, I guess, the number of processors) and 64GB of RAM on each node, with "memory 3 GB" setting in nwchem, so it shouldn't use more than 48GB per node in total? So does it have a separate process per core? Because in the message above there are listed 3 processes on node n07-082, each taking up about 6.5GB, which it shouldn't do.

Now the really strange thing is: this only happens for the smaller structures. Bigger structures (about 50% more heavy atoms) with the same settings don't show this problem at all. And even for the smaller ones, same input might work without a problem if tried again.

Maybe someone can help me, I wanted to ask here first before talking to IT.

Thanks a lot.