Erratic behaviour with multi-node runs


Click here for full thread
Clicked A Few Times
Hi all,

I've tried some calculations with NWCHEM (version 6.5) on a local multi-node cluster and I get some weird results and a lot of abnormal terminations... The following simple calculation should converge in 5 steps.

echo
title "rhf_benchmark"
start  rhf_benchmark

geometry noautoz noautosym nocenter
C       9.760800        10.312800       24.029200
C       9.415700        10.321500       21.238400
C       7.491800        8.241800        17.193500
C       6.891000        7.281600        17.961300
C       7.051900        7.269700        19.373900
C       7.425500        7.327000        22.254400
C       7.619300        7.343300        23.659900
C       8.399500        8.315500        24.236000
C       8.986300        9.289800        23.447000
C       8.810100        9.306000        22.037500
C       8.019000        8.278600        21.441200
C       10.330700       11.293700       23.276500
C       7.843600        8.264600        19.973200
C       10.175200       11.309900       21.877100
C       9.766800        11.289400       18.983400
C       9.602000        11.243900       17.573900
C       8.868700        10.252200       16.975600
C       8.269400        9.249700        17.759600
C       8.440100        9.254100        19.174100
C       9.222000        10.292300       19.789600
H       9.862800        10.328000       25.153100
H       7.394600        8.271100        16.123000
H       6.293300        6.495600        17.495200
H       6.579100        6.495600        20.017600
H       6.799000        6.571400        21.803400
H       7.116800        6.506400        24.335900
H       8.575300        8.314400        25.344800
H       10.866400       12.060200       23.881900
H       10.665200       12.125100       21.268700
H       10.390700       12.060200       19.472800
H       10.080400       12.027700       16.899900
H       8.742200        10.144000       15.911200
end

basis
 * library "6-31+g*"
end

scf
 direct
 maxiter 500
 vectors input atomic output rhf.movecs
end
task scf energy


When I run it on a single node, I get the proper convergence (also checked with my workstation and another HPC cluster I have access on):

              iter       energy          gnorm     gmax       time
             ----- ------------------- --------- --------- --------
                 1     -764.1592324706  1.52D+00  2.03D-01     21.7
                 2     -764.3730890559  4.21D-01  4.49D-02     32.8
                 3     -764.3914294187  3.79D-02  7.22D-03     64.8
                 4     -764.3916250719  7.35D-04  1.09D-04    117.7
                 5     -764.3916251806  5.31D-06  9.45D-07    191.6


When trying multiple nodes, I have convergence issues. Of course for this simple case it does not matter. Larger, more demanding runs that combine SCF and DFT runs just crash... This is an example of an erratic behaviour:

              iter       energy          gnorm     gmax       time
             ----- ------------------- --------- --------- --------
                 1     -764.1592324706  7.03D+01  2.95D+01      8.0
  Setting level-shift to 334.61 to force positive preconditioner
                 2     -768.1858396399  5.41D+01  3.37D+01     77.6
  Setting level-shift to   9.48 to force positive preconditioner
                 3     -778.0312416259  3.86D+01  1.39D+01    113.7
  Setting level-shift to   8.03 to force positive preconditioner
                 4     -779.2666815470  5.25D+01  5.02D+01    123.3
                 5     -751.0850942711  2.43D+01  1.27D+01    209.3
  Setting level-shift to  26.03 to force positive preconditioner
                 6     -763.7538899332  4.15D+00  3.25D+00    218.6
                 7     -764.3283623955  1.10D+00  3.23D-01    227.5
                 8     -764.3753745707  2.56D+00  5.05D-01    231.9
  Setting level-shift to 114.24 to force positive preconditioner
                 9     -764.3854876192  4.25D-01  7.98D-02    261.4
  ga_iter_lsolve: convergence stagnant ... aborting solve

 Disabled NR: increased maxiter to 510

                10     -764.3906628061  8.58D-02  1.62D-02    273.5
                11     -764.3913533709  1.06D-01  2.77D-02    277.7
                12     -764.3915627027  3.27D-02  6.39D-03    281.8
                13     -764.3916095335  2.10D-02  5.43D-03    285.9
                14     -764.3916196286  9.96D-03  2.44D-03    290.0
                15     -764.3916228030  2.79D-03  5.27D-04    294.2
                16     -764.3916233437  2.11D-03  5.17D-04    298.2
                17     -764.3916235550  5.07D-04  1.08D-04    302.3
                18     -764.3916235980  4.34D-04  6.34D-05    310.1
                19     -764.3916236203  2.29D-04  2.04D-05    319.1
                20     -764.3916236235  8.51D-05  8.99D-06    323.1


I'm pretty much convinced that it has to do with the way my cluster admins built NWCHEM. I contacted them but they couldn't really help me... The building specs they used are here:

https://github.com/UCL-RITS/rcps-buildscripts/blob/master/nwchem-6.5_install

This is the submission script I am using:

#!/bin/bash -l
#$ -S /bin/bash
#$ -l h_rt=01:00:00
#$ -l mem=1G
#$ -N nw64
#$ -pe mpi 64
#$ -wd <... directory ...>
module load python/2.7.9
module load nwchem/6.5-r26243/intel-2015-update2
module list
mpirun -np $NSLOTS -machinefile $TMPDIR/machines nwchem rhf.nw 


We managed to fix the issue with the machinefile, but the calculations do not behave as they should...

Any suggestions will be greatly appreciated!!!

Best regards,
Orestis