Issue with pspw using nwchem 6.3 on AMD Bulldozer -- bug in nwchem 6.3?

Click here for full thread
Gets Around
While playing around with the pspw module using nwchem 6.3 I noticed that while the jobs would finish without issue on my AMD Athlon II X3 445, AMD Phenom II X6 1055T and Intel i5-2400 nodes, they failed to converge on my FX8150 and FX8350 nodes. Instead, the structures blew up. I've tried no external blas, openblas and acml (int; with and without fma4), all with the same results. However, the jobs do converge on the bulldozer and vishera (FX 8x50) cpus when using nwchem 6.1.1, with would indicate an issue specific to nwchem 6.3.

Edit: I've never noticed any issues using the DFT module -- it's only pspw that's causing the issue, and only in nwchem 6.3

(There's also a very minor typo in the nwchem code (src/band/minimizer/band_minimizer.F) -- it says Grassman/Stiefel with one n in Grassmann, rather than two.)

I'm not trying to do any production work in pspw, but rather I just want to report a possible bug.

The following job was used (it's just a random test job that triggers the issue):
scratch_dir /home/me/scratch
Title "pspw test job"

Start  biphenyl_cation_twisted-1


charge 1

geometry autosym units angstrom
 C     0.00000     -3.54034     0.00000
 C     -1.20296     -2.84049     -0.216000
 C     -1.20944     -1.46171     -0.206253
 C     0.00000     -0.721866     0.00000
 C     1.20944     -1.46171     0.206253
 C     1.20296     -2.84049     0.216000
 C     0.00000     0.721866     0.00000
 C     1.20944     1.46171     -0.206253
 C     1.20296     2.84049     -0.216000
 C     -1.20944     1.46171     0.206253
 C     0.00000     3.54034     0.00000
 C     -1.20296     2.84049     0.216000
 H     0.00000     -4.62590     0.00000
 H     -2.12200     -3.38761     -0.395378
 H     -2.13673     -0.938003     -0.401924
 H     2.12200     -3.38761     0.395378
 H     2.12200     3.38761     -0.395378
 H     -2.13673     0.938003     0.401924
 H     0.00000     4.62590     0.00000
 H     -2.12200     3.38761     0.395378
 H     2.13673     0.938003     -0.401924
 H     2.13673     -0.938003     0.401924

ecce_print /home/me/neon/job1056/ecce.out

      2.000000e+01 0.000000e+00 0.000000e+00
      0.000000e+00 2.000000e+01 0.000000e+00
      0.000000e+00 0.000000e+00 2.000000e+01
  mult 2
  np_dimensions -1  -1  
  tolerances 1e-7  1e-7


task pspw optimize

The following nodes worked fine:
AMD Athlon II X3, 8 Gb RAM
AMD Phenom II X6, 8 Gb RAM
Intel i5-2400, 16 Gb RAM

The final structure was this:
Structure 1
C               0.00000            -3.43835             0.00000
C              -1.17952            -2.76904            -0.27325
C              -1.18743            -1.41309            -0.25617
C               0.00000            -0.70724             0.00000
C               1.18743            -1.41309             0.25617
C               1.17952            -2.76904             0.27325
C               0.00000             0.70724             0.00000
C               1.18743             1.41309            -0.25617
C               1.17952             2.76904            -0.27325
C              -1.18743             1.41309             0.25617
C               0.00000             3.43835             0.00000
C              -1.17952             2.76904             0.27325
H               0.00000            -4.51201             0.00000
H              -2.07930            -3.32101            -0.49999
H              -2.09224            -0.87241            -0.48912
H               2.07930            -3.32101             0.49999
H               2.07930             3.32101            -0.49999
H              -2.09224             0.87241             0.48912
H               0.00000             4.51201             0.00000
H              -2.07930             3.32101             0.49999
H               2.09224             0.87241            -0.48912
H               2.09224            -0.87241             0.48912

cat nwch.nwout|grep "Total PSPW energy"
 Total PSPW energy   :  -0.7403126784E+02
 Total PSPW energy   :  -0.7403944621E+02
 Total PSPW energy   :  -0.7404121161E+02
 Total PSPW energy   :  -0.7404171961E+02
 Total PSPW energy   :  -0.7404173291E+02
 Total PSPW energy   :  -0.7404176133E+02
 Total PSPW energy   :  -0.7404176138E+02
 Total PSPW energy   :  -0.7404178719E+02
 Total PSPW energy   :  -0.7404179836E+02
 Total PSPW energy   :  -0.7404181219E+02
 Total PSPW energy   :  -0.7404181529E+02
 Total PSPW energy   :  -0.7404183416E+02
 Total PSPW energy   :  -0.7404183384E+02
 Total PSPW energy   :  -0.7404182554E+02
 Total PSPW energy   :  -0.7404183894E+02
 Total PSPW energy   :  -0.7404184777E+02
 Total PSPW energy   :  -0.7404184781E+02
 Total PSPW energy   :  -0.7404185248E+02
 Total PSPW energy   :  -0.7404185252E+02
 Total PSPW energy   :  -0.7404183390E+02
 Total PSPW energy   :  -0.7404185160E+02
 Total PSPW energy   :  -0.7404185179E+02
 Total PSPW energy   :  -0.7404185172E+02
 Total PSPW energy   :  -0.7404185163E+02
 Total PSPW energy   :  -0.7404185167E+02

The step/energy data for the first geometry cycle behaves:
== Energy Calculation ==

          ====== Grassmann conjugate gradient iteration ======
     >>>  ITERATION STARTED AT Wed Nov 13 14:25:02 2013  <<<
    iter.           Energy         DeltaE       DeltaRho 
     -  15 steepest descent iterations performed
      10   -0.3692699170E+02   -0.26760E+01    0.53590E-02
     -  10 steepest descent iterations performed
      20   -0.6158204365E+02   -0.41876E+00    0.31092E-03
     -  10 steepest descent iterations performed
      30   -0.7003979846E+02   -0.12758E+00    0.42769E-04
     -  10 steepest descent iterations performed
      40   -0.7263392229E+02   -0.34324E-01    0.21020E-04
     -  10 steepest descent iterations performed
      50   -0.7336419645E+02   -0.13767E-01    0.10561E-04
     -  10 steepest descent iterations performed
      60   -0.7380342913E+02   -0.14352E-01    0.99511E-06
     -  10 steepest descent iterations performed
      70   -0.7397649640E+02   -0.61870E-02    0.24790E-04
      80   -0.7401568945E+02   -0.25650E-02    0.10674E-04
     270   -0.7403126784E+02   -0.99193E-07    0.18719E-09
  *** tolerance ok. iteration terminated
     >>>  ITERATION ENDED   AT Wed Nov 13 14:49:47 2013  <<<

The following nodes lead to exploding structures:
AMD FX8150, 32 Gb RAM
AMD FX8350, 32 Gb RAM

Structure 23
C               0.00000            -3.28702             0.00000
C              -3.07661            -4.04679            -3.00814
C              -2.98013            -0.93045            -3.30301
C               0.00000            -1.11917             0.00000
C               2.98013            -0.93045             3.30301
C               3.07661            -4.04679             3.00814
C               0.00000             1.11917             0.00000
C               2.98013             0.93045            -3.30301
C               3.07661             4.04679            -3.00814
C              -2.98013             0.93045             3.30301
C               0.00000             3.28702             0.00000
C              -3.07661             4.04679             3.00814
H               0.00000            -4.78561             0.00000
H              -4.23747            -6.33817            -3.55692
H              -4.05310             0.84076            -3.86806
H               4.23747            -6.33817             3.55692
H               4.23747             6.33817            -3.55692
H              -4.05310            -0.84076             3.86806
H               0.00000             4.78561             0.00000
H              -4.23747             6.33817             3.55692
H               4.05310            -0.84076            -3.86806
H               4.05310             0.84076             3.86806

cat nwch.nwout|grep "Total PSPW energy"
 Total PSPW energy   :   0.7974031861E+02
 Total PSPW energy   :   0.7246518114E+02
 Total PSPW energy   :   0.6606772951E+02
 Total PSPW energy   :   0.5673312109E+02
 Total PSPW energy   :   0.4840063736E+02
 Total PSPW energy   :   0.4059627772E+02
 Total PSPW energy   :   0.3478817836E+02
 Total PSPW energy   :   0.2592070851E+02
 Total PSPW energy   :   0.1961993049E+02
 Total PSPW energy   :   0.1337599142E+02
 Total PSPW energy   :   0.8368934197E+01
 Total PSPW energy   :   0.4070828454E+01
 Total PSPW energy   :   0.4890631969E+00
 Total PSPW energy   :  -0.5836265579E+01
 Total PSPW energy   :  -0.7890745466E+01
 Total PSPW energy   :  -0.1408910732E+02
 Total PSPW energy   :  -0.1376509590E+02
 Total PSPW energy   :  -0.1592566062E+02
 Total PSPW energy   :  -0.1795967556E+02
 Total PSPW energy   :  -0.2040058313E+02
 Total PSPW energy   :  -0.2225738586E+02
 Total PSPW energy   :  -0.2338027226E+02
 Total PSPW energy   :  -0.2338044505E+02
 Total PSPW energy   :  -0.2461935725E+02
 Total PSPW energy   :  -0.2515495572E+02
 Total PSPW energy   :  -0.2562143558E+02
 Total PSPW energy   :  -0.2561106019E+02
 Total PSPW energy   :  -0.2618964488E+02
 Total PSPW energy   :  -0.2615732606E+02
 Total PSPW energy   :  -0.2636682412E+02
 Total PSPW energy   :  -0.2633912461E+02
 Total PSPW energy   :  -0.2639189830E+02
 Total PSPW energy   :  -0.2638634234E+02
 Total PSPW energy   :  -0.2646209976E+02
 Total PSPW energy   :  -0.2645190214E+02
 Total PSPW energy   :  -0.2649565926E+02
 Total PSPW energy   :  -0.2655957858E+02
 Total PSPW energy   :  -0.2634583373E+02
 Total PSPW energy   :  -0.2656000596E+02
 Total PSPW energy   :  -0.2615164061E+02
 Total PSPW energy   :  -0.2648948900E+02

The step/energy data for the first cycle has positive, rather than negative, energies that descend towards zero:
== Energy Calculation ==

          ====== Grassmann conjugate gradient iteration ======
     >>>  ITERATION STARTED AT Wed Nov 13 23:30:15 2013  <<<
    iter.           Energy         DeltaE       DeltaRho 
     -  15 steepest descent iterations performed
      10    0.9582453153E+02   -0.17155E+00    0.52948E-04
     -  10 steepest descent iterations performed
      20    0.9035909073E+02   -0.19625E+00    0.47200E-05
     -  10 steepest descent iterations performed
      30    0.8605965298E+02   -0.65226E-01    0.72813E-05
     -  10 steepest descent iterations performed
      40    0.8398586767E+02   -0.73599E-01    0.65304E-06
     -  10 steepest descent iterations performed
      50    0.8228907713E+02   -0.30066E-01    0.22806E-05
     -  10 steepest descent iterations performed
      60    0.8161797538E+02   -0.21635E-01    0.33675E-06
     -  10 steepest descent iterations performed
      70    0.8081694861E+02   -0.16072E-01    0.31301E-06
     -  10 steepest descent iterations performed
      80    0.8048826777E+02   -0.10237E-01    0.27982E-06
     -  10 steepest descent iterations performed
      90    0.8013146841E+02   -0.20216E-02    0.15217E-06
     100    0.8000423850E+02   -0.22420E-01    0.14079E-04
     -  10 steepest descent iterations performed
     110    0.7984322686E+02   -0.14763E-02    0.19813E-06
     120    0.7979788211E+02   -0.13952E-01    0.70838E-05
     -  10 steepest descent iterations performed
     130    0.7974031861E+02   -0.64662E-03    0.10059E-06
     140    0.7974031861E+02    0.17764E-14    0.31320E-30
  *** energy going up. iteration not terminated
  *** tolerance ok. iteration terminated
     >>>  ITERATION ENDED   AT Wed Nov 13 23:49:46 2013  <<<

NWChem was compiled using the following script:
export NWCHEM_TOP=`pwd`
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=`pwd`
export NWCHEM_MODULES="all python"
export PYTHONHOME=/usr
export BLASOPT="-L/opt/openblas/lib -lopenblas"

export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/lib/openmpi/lib
export MPI_INCLUDE=/usr/lib/openmpi/include
export LIBRARY_PATH="$LIBRARY_PATH:/usr/lib/openmpi/lib:/opt/openblas/lib"

export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"

cd $NWCHEM_TOP/src

make clean
make nwchem_config
make FC=gfortran 1> make.log 2>make.err

cd $NWCHEM_TOP/contrib
export FC=gfortran

The main thing that shows up in make.err for nwchem 6.3 on the bulldozer cores that doesn't show up for e.g. the Phenom II cpu is
/usr/bin/ld: Warning: alignment 16 of symbol `cface_' in /opt/nwchem/nwchem-6.3-src.2013-05-28/lib/LINUX64/libstepper.a(stpr_face.o) is smaller than 32 in /opt/nwchem/nwchem-6.3-src.2013-05-28/lib/LINUX64/libstepper.a(stpr_partit.o)

This doesn't show up for nwchem 6.1.1.

Using nwchem 6.1.1 both FX8150 and FX8350 work:
          ====== Grassmann conjugate gradient iteration ======
     >>>  ITERATION STARTED AT Thu Nov 14 16:38:02 2013  <<<
    iter.           Energy         DeltaE       DeltaRho 
      10   -0.5440813275E+02   -0.23233E+01    0.72286E-02
     -  10 steepest descent iterations performed
      20   -0.7080853812E+02   -0.14463E+00    0.19753E-03
     -  10 steepest descent iterations performed
      30   -0.7308197239E+02   -0.20838E-01    0.16698E-04
     -  10 steepest descent iterations performed
      40   -0.7349928511E+02   -0.14939E-01    0.93367E-03
     -  10 steepest descent iterations performed
      50   -0.7361462983E+02   -0.92625E-02    0.88102E-05
      60   -0.7369416281E+02   -0.57087E-02    0.34645E-04
     310   -0.7403126685E+02   -0.12879E-06    0.44149E-09
     320   -0.7403126810E+02   -0.86892E-07    0.25214E-09
  *** tolerance ok. iteration terminated
     >>>  ITERATION ENDED   AT Thu Nov 14 17:03:46 2013  <<<