vrange nnbf - Out of memory events?

Just Got Here

8:20:51 AM PDT - Thu, Sep 8th 2016

Hey there,

I recently got a problem with some MP2 calculations using several nodes on our HPC cluster. So the calculation itself runs a while and then terminates with no real error code, just like this:

   NWChem MP2 Semi-direct Energy/Gradient Module

 



  

                              BCN_sAz_TS12_lv3 IDA

  

  

 Basis functions       =    682

 Molecular orbitals    =    675

 Frozen core           =     11

 Frozen virtuals       =      0

 Active alpha occupied =     30

 Active beta occupied  =     30

 Active alpha virtual  =    634

 Active beta virtual   =    634

 Use MO symmetry       = F

 Use skeleton AO sym   = F

 

 AO/Fock/Back tols     =   1.0D-09  1.0D-09  1.0D-09

 

GA uses MA = F    GA memory limited = T

   

Available: 

 local mem=  2.01D+08

global mem=  2.01D+08

local disk=  1.67D+11

 mp2_memr nvloc                      8

 nvloc new                      8

  1 passes of  30:        2914906                     710274                  98697852.



Semi-direct pass number   1 of   1  for RHF alpha+beta  at     3341.8s

 vrange nnbf                 170380

 srun: error: n15-006: task 2: Killed

 srun: Terminating job step 4189881.0

 srun: error: n15-007: task 17: Killed

 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

 srun: error: n15-028: task 65: Killed

 srun: error: n15-015: task 62: Killed

 srun: error: n15-013: tasks 32-47: Killed

 srun: error: n15-028: tasks 64,66-79: Killed

 srun: error: n15-007: tasks 16,18-31: Killed

 srun: error: n15-015: tasks 48-61,63: Killed

 slurmstepd: error: *** STEP 4189881.0 ON n15-006 CANCELLED AT 2016-09-08T16:41:28 ***

 srun: error: n15-006: tasks 0-1,3-15: Killed

What does " vrange nnbf 170380" mean?

I get out-of-memory notifications for those jobs like that:

 jobid: 4188259

 n14-049: kernel Killed process 28251, UID 71897, (nwchem) total-vm 6362392kB, anon-rss 4213752kB, file-rss     137892kB

 jobid: 4188258

 n07-080: kernel Killed process 31758, UID 71897, (nwchem) total-vm 6571668kB, anon-rss 4089672kB, file-rss   138132kB

 n07-080: kernel Killed process 31747, UID 71897, (nwchem) total-vm 6570100kB, anon-rss 4396084kB, file-rss 137876kB

 n07-082: kernel Killed process 2139, UID 71897, (nwchem) total-vm 6461920kB, anon-rss 4055784kB, file-rss 138008kB

 n07-082: kernel Killed process 2177, UID 71897, (nwchem) total-vm 6461920kB, anon-rss 4055788kB, file-rss 137676kB

 n07-082: kernel Killed process 2149, UID 71897, (nwchem) total-vm 6509952kB, anon-rss 4349100kB, file-rss 137732kB

Now this is where my problem is. I'm no IT guy so I don't really understand what that means. I know that this means nwchem tried to use more memory than there is available. The message says about 6.5GB in total. Now I guess here's the problem? I run that job on 5 nodes, 16 cores each, tasks per node 16 (so no multithreading, nwchem also states nproc = 80, which is, I guess, the number of processors) and 64GB of RAM on each node, with "memory 3 GB" setting in nwchem, so it shouldn't use more than 48GB per node in total? So does it have a separate process per core? Because in the message above there are listed 3 processes on node n07-082, each taking up about 6.5GB, which it shouldn't do.

Now the really strange thing is: this only happens for the smaller structures. Bigger structures (about 50% more heavy atoms) with the same settings don't show this problem at all. And even for the smaller ones, same input might work without a problem if tried again.

Maybe someone can help me, I wanted to ask here first before talking to IT.

Thanks a lot.

Forum Vet

10:06:57 AM PDT - Thu, Sep 8th 2016
Details of computer used
What computer have you been using? Could you provide details about it and about the NWChem compilation, too?

Just Got Here

12:39:48 PM PDT - Thu, Sep 8th 2016

Thanks for your answer.

I'm using the VSC3 ( http://vsc.ac.at/systems/vsc-3/ ), each node has two Intel Xeon E5-2650v2, 2.6 GHz, 8 cores and about 64 GB RAM. The cluster is running Scientific Linux with SLURM as workload manager.

NWChem is version 6.6,

Typical input and used modules:

 module load intel/16.0.2 intel-mkl/11.3 intel-mpi/5.1.1 python/2.7 nwchem/6.6

 

 srun -N 5 -K1 --ntasks-per-node=16 nwchem BCN_sAz_TS12_lv3.nw &> BCN_sAz_TS12_lv3.nwo

I've also checked available memory on the node before loading modules and running the job:

            total       used       free     shared    buffers     cached

 Mem:      66072472    1052644   65019828      28708          0     113668

 -/+ buffers/cache:     938976   65133496

 Swap:            0          0          0

Forum Vet

4:24:43 PM PDT - Thu, Sep 8th 2016
Memory line in input file
Could you send me the memory line you used in your input file and what appears in the memory information section of your output file? If you have 64 gb on your node, it might be a good idea to try to use only up to 80/85% of the available memory, since the operating system is using some of the available memory, therefore your choice of 3GB is -- in principle -- safe. One way to cut on the memory usage is to add the following line just before the `task mp2` line set mp2:npasses 2

Just Got Here

12:13:11 PM PDT - Thu, Sep 22nd 2016

Sorry for the late answer.

My input section is:

memory 3000 mb

NWchem seem to recognize this and gives this output:

         Memory information
           ------------------

    heap     =   98303996 doubles =    750.0 Mbytes
    stack    =   98304001 doubles =    750.0 Mbytes
    global   =  196608000 doubles =   1500.0 Mbytes (distinct from heap & stack)
    total    =  393215997 doubles =   3000.0 Mbytes
    verify   = yes
    hardfail = no

And this failed with:

n16-025: kernel Killed process 15178, UID 71897, (nwchem) total-vm 6344264kB, anon-rss 3489356kB, file-rss 138232kB

I have the same problem if I don't specify and I use the default which is:

           Memory information
           ------------------

    heap     =  100663290 doubles =    768.0 Mbytes
    stack    =  100663295 doubles =    768.0 Mbytes
    global   =  201326592 doubles =   1536.0 Mbytes (distinct from heap & stack)
    total    =  402653177 doubles =   3072.0 Mbytes
    verify   = yes
    hardfail = no

I will try the suggested setting.

Forum >> NWChem's corner >> Running NWChem