vrange nnbf - Out of memory events?


Just Got Here
Hey there,

I recently got a problem with some MP2 calculations using several nodes on our HPC cluster. So the calculation itself runs a while and then terminates with no real error code, just like this:




   NWChem MP2 Semi-direct Energy/Gradient Module



BCN_sAz_TS12_lv3 IDA


Basis functions = 682
Molecular orbitals = 675
Frozen core = 11
Frozen virtuals = 0
Active alpha occupied = 30
Active beta occupied = 30
Active alpha virtual = 634
Active beta virtual = 634
Use MO symmetry = F
Use skeleton AO sym = F

AO/Fock/Back tols = 1.0D-09 1.0D-09 1.0D-09

GA uses MA = F GA memory limited = T

Available:
local mem= 2.01D+08
global mem= 2.01D+08
local disk= 1.67D+11
mp2_memr nvloc 8
nvloc new 8
1 passes of 30: 2914906 710274 98697852.

Semi-direct pass number 1 of 1 for RHF alpha+beta at 3341.8s
vrange nnbf 170380
srun: error: n15-006: task 2: Killed
srun: Terminating job step 4189881.0
srun: error: n15-007: task 17: Killed
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: n15-028: task 65: Killed
srun: error: n15-015: task 62: Killed
srun: error: n15-013: tasks 32-47: Killed
srun: error: n15-028: tasks 64,66-79: Killed
srun: error: n15-007: tasks 16,18-31: Killed
srun: error: n15-015: tasks 48-61,63: Killed
slurmstepd: error: *** STEP 4189881.0 ON n15-006 CANCELLED AT 2016-09-08T16:41:28 ***
srun: error: n15-006: tasks 0-1,3-15: Killed




What does " vrange nnbf 170380" mean?

I get out-of-memory notifications for those jobs like that:



 jobid: 4188259
n14-049: kernel Killed process 28251, UID 71897, (nwchem) total-vm 6362392kB, anon-rss 4213752kB, file-rss 137892kB
jobid: 4188258
n07-080: kernel Killed process 31758, UID 71897, (nwchem) total-vm 6571668kB, anon-rss 4089672kB, file-rss 138132kB
n07-080: kernel Killed process 31747, UID 71897, (nwchem) total-vm 6570100kB, anon-rss 4396084kB, file-rss 137876kB
n07-082: kernel Killed process 2139, UID 71897, (nwchem) total-vm 6461920kB, anon-rss 4055784kB, file-rss 138008kB
n07-082: kernel Killed process 2177, UID 71897, (nwchem) total-vm 6461920kB, anon-rss 4055788kB, file-rss 137676kB
n07-082: kernel Killed process 2149, UID 71897, (nwchem) total-vm 6509952kB, anon-rss 4349100kB, file-rss 137732kB



Now this is where my problem is. I'm no IT guy so I don't really understand what that means. I know that this means nwchem tried to use more memory than there is available. The message says about 6.5GB in total. Now I guess here's the problem? I run that job on 5 nodes, 16 cores each, tasks per node 16 (so no multithreading, nwchem also states nproc = 80, which is, I guess, the number of processors) and 64GB of RAM on each node, with "memory 3 GB" setting in nwchem, so it shouldn't use more than 48GB per node in total? So does it have a separate process per core? Because in the message above there are listed 3 processes on node n07-082, each taking up about 6.5GB, which it shouldn't do.

Now the really strange thing is: this only happens for the smaller structures. Bigger structures (about 50% more heavy atoms) with the same settings don't show this problem at all. And even for the smaller ones, same input might work without a problem if tried again.

Maybe someone can help me, I wanted to ask here first before talking to IT.

Thanks a lot.

Forum Vet
Details of computer used
What computer have you been using?
Could you provide details about it and about the NWChem compilation, too?

Just Got Here
Thanks for your answer.

I'm using the VSC3 ( http://vsc.ac.at/systems/vsc-3/ ), each node has two Intel Xeon E5-2650v2, 2.6 GHz, 8 cores and about 64 GB RAM. The cluster is running Scientific Linux with SLURM as workload manager.

NWChem is version 6.6,


Typical input and used modules:

 module load intel/16.0.2 intel-mkl/11.3 intel-mpi/5.1.1 python/2.7 nwchem/6.6

srun -N 5 -K1 --ntasks-per-node=16 nwchem BCN_sAz_TS12_lv3.nw &> BCN_sAz_TS12_lv3.nwo

I've also checked available memory on the node before loading modules and running the job:

            total       used       free     shared    buffers     cached
Mem: 66072472 1052644 65019828 28708 0 113668
-/+ buffers/cache: 938976 65133496
Swap: 0 0 0

Forum Vet
Memory line in input file
Could you send me the memory line you used in your input file and what appears in the memory information section of your output file?
If you have 64 gb on your node, it might be a good idea to try to use only up to 80/85% of the available memory, since the operating system is using some of the available memory, therefore your choice of 3GB is -- in principle -- safe.

One way to cut on the memory usage is to add the following line just before the task mp2 line

set mp2:npasses 2

Just Got Here
Sorry for the late answer.

My input section is:

memory 3000 mb


NWchem seem to recognize this and gives this output:

         Memory information
           ------------------

    heap     =   98303996 doubles =    750.0 Mbytes
    stack    =   98304001 doubles =    750.0 Mbytes
    global   =  196608000 doubles =   1500.0 Mbytes (distinct from heap & stack)
    total    =  393215997 doubles =   3000.0 Mbytes
    verify   = yes
    hardfail = no 


And this failed with:
n16-025: kernel Killed process 15178, UID 71897, (nwchem) total-vm 6344264kB, anon-rss 3489356kB, file-rss 138232kB



I have the same problem if I don't specify and I use the default which is:
           Memory information
           ------------------

    heap     =  100663290 doubles =    768.0 Mbytes
    stack    =  100663295 doubles =    768.0 Mbytes
    global   =  201326592 doubles =   1536.0 Mbytes (distinct from heap & stack)
    total    =  402653177 doubles =   3072.0 Mbytes
    verify   = yes
    hardfail = no 


I will try the suggested setting.


Forum >> NWChem's corner >> Running NWChem