nwchem runtime difference - single node vs multi-node


Click here for full thread
Just Got Here
Hi all,

We have to install nwchem for the users on our shared cluster here at the university.

Note that we did not have the following issue on our old CentOS6 system, but we are facing it on our new RedHat7 system:

Linux ... 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux


I am using the following h2o input file for the testing (although this issue was the same for some other "basic" input files):
start h2o_freq
charge 1
geometry units angstroms
 O       0.0  0.0  0.0
 H       0.0  0.0  1.0
 H       0.0  1.0  0.0
end
basis
 H library sto-3g
 O library sto-3g
end
scf
 uhf; doublet
 print low
end
title "H2O+ : STO-3G UHF geometry optimization"
task scf optimize
basis
 H library 6-31g**
 O library 6-31g**
end
title "H2O+ : 6-31g** UMP2 geometry optimization"
task mp2 optimize
mp2; print none; end
scf; print none; end
title "H2O+ : 6-31g** UMP2 frequencies"
task mp2 freq


When using 2 MPI processes on a single node, the walltime is around 15 seconds.
Here are the first few and last few lines of the output:
           Job information
           ---------------

    hostname        = node404.oscar.ccv.brown.edu
    program         = /gpfs/runtime/opt/nwchem/6.8-openmpi/bin/nwchem
    date            = Mon Jun 18 11:44:13 2018

    compiled        = Fri_Jun_15_16:31:11_2018
    source          = /gpfs/runtime/opt/nwchem/6.8-openmpi/src/nwchem-6.8
    nwchem branch   = 6.8
    nwchem revision = v6.8-47-gdf6c956
    ga revision     = ga-5.6.3
    use scalapack   = F
    input           = h2o.nw
    prefix          = h2o_freq.
    data base       = ./h2o_freq.db
    status          = startup
    nproc           =        2
    time left       =   3598s
.
.
.
.
.
 ----------------------------------------------------------------------------
 Normal Eigenvalue ||    Projected Derivative Dipole Moments (debye/angs)
  Mode   [cm**-1]  ||      [d/dqX]             [d/dqY]           [d/dqZ]
 ------ ---------- || ------------------ ------------------ -----------------
    1       -0.000 ||      -1.131               0.000             0.000
    2       -0.000 ||       1.701               0.000             0.404
    3       -0.000 ||      -0.651               0.000             1.057
    4        0.000 ||       0.000              -0.044             0.000
    5        0.000 ||       0.000               2.480             0.000
    6        0.000 ||       0.000               2.480             0.000
    7     1484.716 ||       0.000               0.000             2.112
    8     3460.149 ||      -0.000               0.000             1.877
    9     3551.507 ||       3.435               0.000            -0.000
 ----------------------------------------------------------------------------



  
  
 ----------------------------------------------------------------------------
 Normal Eigenvalue ||           Projected Infra Red Intensities
  Mode   [cm**-1]  || [atomic units] [(debye/angs)**2] [(KM/mol)] [arbitrary]
 ------ ---------- || -------------- ----------------- ---------- -----------
    1       -0.000 ||    0.055473           1.280        54.077       3.034
    2       -0.000 ||    0.132537           3.058       129.203       7.249
    3       -0.000 ||    0.066795           1.541        65.115       3.653
    4        0.000 ||    0.000084           0.002         0.082       0.005
    5        0.000 ||    0.266538           6.149       259.834      14.578
    6        0.000 ||    0.266538           6.149       259.834      14.578
    7     1484.716 ||    0.193397           4.462       188.533      10.577
    8     3460.149 ||    0.152660           3.522       148.821       8.349
    9     3551.507 ||    0.511546          11.802       498.680      27.978
 ----------------------------------------------------------------------------



 vib:animation  F

 Task  times  cpu:        8.2s     wall:        9.3s
 
 
                                NWChem Input Module
                                -------------------
 
 
 Summary of allocated global arrays
-----------------------------------
  No active global arrays



                         GA Statistics for process    0
                         ------------------------------

       create   destroy   get      put      acc     scatter   gather  read&inc
calls: 1.78e+04 1.78e+04 2.40e+05 5.73e+04 7.71e+04 2485        0     1.39e+04 
number of processes/call 1.03e+00 1.04e+00 1.06e+00 0.00e+00 0.00e+00
bytes total:             6.87e+07 4.90e+07 2.00e+07 4.00e+02 0.00e+00 1.11e+05
bytes remote:            5.69e+06 7.90e+06 3.69e+06 0.00e+00 0.00e+00 0.00e+00
Max memory consumed for GA by this process: 514056 bytes
 
...

 Total times  cpu:       12.3s     wall:       14.7s
MA_summarize_allocated_blocks: starting scan ...
MA_summarize_allocated_blocks: scan completed: 0 heap blocks, 0 stack blocks
MA usage statistics:

	allocation statistics:
					      heap	     stack
					      ----	     -----
	current number of blocks	         0	         0
	maximum number of blocks	        25	        51
	current total bytes		         0	         0
	maximum total bytes		  31471376	  22510232
	maximum total K-bytes		     31472	     22511
	maximum total M-bytes		        32	        23



When running using 1 process each on 2 nodes, the walltime is 227 seconds:
           Job information
           ---------------

    hostname        = node404.oscar.ccv.brown.edu
    program         = /gpfs/runtime/opt/nwchem/6.8-openmpi/bin/nwchem
    date            = Mon Jun 18 11:25:00 2018

    compiled        = Fri_Jun_15_16:31:11_2018
    source          = /gpfs/runtime/opt/nwchem/6.8-openmpi/src/nwchem-6.8
    nwchem branch   = 6.8
    nwchem revision = v6.8-47-gdf6c956
    ga revision     = ga-5.6.3
    use scalapack   = F
    input           = h2o.nw
    prefix          = h2o_freq.
    data base       = ./h2o_freq.db
    status          = startup
    nproc           =        2
    time left       =   3599s
.
.
.
.
.
 ----------------------------------------------------------------------------
 Normal Eigenvalue ||    Projected Derivative Dipole Moments (debye/angs)
  Mode   [cm**-1]  ||      [d/dqX]             [d/dqY]           [d/dqZ]
 ------ ---------- || ------------------ ------------------ -----------------
    1       -0.000 ||      -0.651               0.000             1.057
    2        0.000 ||       0.000              -0.044             0.000
    3        0.000 ||       0.000               2.480             0.000
    4        0.000 ||       0.000               2.480             0.000
    5        0.000 ||      -1.131               0.000             0.000
    6        0.000 ||       1.701               0.000             0.404
    7     1484.768 ||       0.000               0.000             2.112
    8     3460.171 ||       0.000               0.000             1.877
    9     3551.514 ||      -3.435               0.000             0.000
 ----------------------------------------------------------------------------



  
  
 ----------------------------------------------------------------------------
 Normal Eigenvalue ||           Projected Infra Red Intensities
  Mode   [cm**-1]  || [atomic units] [(debye/angs)**2] [(KM/mol)] [arbitrary]
 ------ ---------- || -------------- ----------------- ---------- -----------
    1       -0.000 ||    0.066797           1.541        65.117       3.653
    2        0.000 ||    0.000084           0.002         0.082       0.005
    3        0.000 ||    0.266531           6.149       259.828      14.578
    4        0.000 ||    0.266531           6.149       259.828      14.578
    5        0.000 ||    0.055472           1.280        54.077       3.034
    6        0.000 ||    0.132548           3.058       129.215       7.250
    7     1484.768 ||    0.193382           4.461       188.519      10.577
    8     3460.171 ||    0.152668           3.522       148.828       8.350
    9     3551.514 ||    0.511486          11.800       498.622      27.976
 ----------------------------------------------------------------------------



 vib:animation  F

 Task  times  cpu:      134.5s     wall:      135.4s
 
 
                                NWChem Input Module
                                -------------------
 
 
 Summary of allocated global arrays
-----------------------------------
  No active global arrays



                         GA Statistics for process    0
                         ------------------------------

       create   destroy   get      put      acc     scatter   gather  read&inc
calls: 1.77e+04 1.77e+04 3.30e+05 6.34e+04 8.87e+04 2475        0     2.57e+04 
number of processes/call 1.02e+00 1.03e+00 1.05e+00 0.00e+00 0.00e+00
bytes total:             7.45e+07 5.29e+07 2.42e+07 4.00e+02 0.00e+00 2.05e+05
bytes remote:            5.85e+06 8.06e+06 3.69e+06 0.00e+00 0.00e+00 0.00e+00
Max memory consumed for GA by this process: 514056 bytes

...

 Total times  cpu:      225.2s     wall:      227.3s
MA_summarize_allocated_blocks: starting scan ...
MA_summarize_allocated_blocks: scan completed: 0 heap blocks, 0 stack blocks
MA usage statistics:

	allocation statistics:
					      heap	     stack
					      ----	     -----
	current number of blocks	         0	         0
	maximum number of blocks	        25	        51
	current total bytes		         0	         0
	maximum total bytes		  31471376	  22510232
	maximum total K-bytes		     31472	     22511
	maximum total M-bytes		        32	        23



1) I am using OpenMPI version 2.0.3 currently. However, using MVAPICH2 version 2.3 is even worse as it takes about 10 minutes for the same program on 2 nodes with 1 process per node.

2) I've used intel 2017 to compile but using an older version did not make a difference.

3) We have the nwchem version 6.8 installed now as can be seen in the output above, but the issue was first observed with version 6.6.

4) And again, this was not an issue on our CentOS6 system, but it is on our RedHat7 system. (I am sure there are other differences too like MPI/SLURM configuration which can be the cause, but just in case if it is related to the OS...)

In short, running even a simple nwchem program with 2 processes on multiple nodes takes significantly more time (1 order higher) than running with 2 processes on the same node.

Has anyone else encountered a similar problem?