TCE restart on BlueGene/Q


Clicked A Few Times
Dear users and developers,

I try to run nwchem version 6.8.1 on an IBM BGQ cluster. I compiled the program with the following script:
#/bin/bash
export NWCHEM_TOP=/home/i/ihamilto/ehlert/NWCHEM/omp/nwchem-6.8.1
export NWCHEM_TARGET=BGQ
export NWCHEM_MODULES="qm"
export ARMCI_NETWORK=MPI-TS
export USE_MPI=y
export USE_MPIF=y
export USE_OPENMP=y
export LARGE_FILES=TRUE
export MPI_INCLUDE=/bgsys/drivers/ppcfloor/comm/xl/include
ESSL="/opt/ibmmath/essl/5.1/lib64/libesslsmpbg.a"
export BLAS_SIZE=4
export USE_64TO32=y
LAPACK="/home/i/ihamilto/ehlert/NWCHEM/lapack-3.8.0/liblapack.a "
export BLASOPT="$LAPACK $ESSL -Wl,-zmuldefs -lxlsmp"
export DISABLE_GAMIRROR=y

The compilation runs without any issues (except for the ccsd_trpdrv_omp.F , where I adapted the XLF compiler options).

Hartree-Fock and CCSD(T) calculations are also running, however, when I try to save integrals and amplitudes, I run into problems. Here is my sample input file:
memory stack 1200 mb heap 500 mb global 2000 mb

geometry
H  1.0 1.0 0.0
H -1.0 1.0 0.0
symmetry c1
end
basis
  * library def2-qzvpp
end

SCF
 direct
end

TCE
  ccsd(t)
  2eorb
  tilesize 16
END
set tce:save_integrals T T T T T
set tce:save_t T T T T
set tce:read_t T T T T

task tce energy

The last lines of the output look as follows:
Parallel file system coherency ......... OK
 Saving 1-electron integrals now...
      f1_restart_save filename: ./nwchem.f1_copy                                                                
                                                         f1_restart_save finished
 
 Fock matrix recomputed
 1-e file size   =             4900
 1-e file name   = ./nwchem.f1
 Cpu & wall time / sec            2.6            2.6
 4-electron integrals stored in orbital form
 
 v2    file size   =          4556750
 4-index algorithm nr.   1 is used
 imaxsize =       30
 imaxsize ichop =        0
 v2int file size   =          6592705
 Cpu & wall time / sec            5.0            5.0
 Saving 2-electron integrals now...
      v2_restart_save filename: ./nwchem.v2_copy                                                                
 hashn: addr            4 key            1
  length  3
 hashn: addr            4 key            2
 hashn: addr            4 key            3
  length  3
...
 tce_hash_n: key not found                   1


I already tried some 2emet/2eorb variations, however without any success. To me, it is absolutely not clear, where the problem is, especially because the one-electron integrals and amplitudes are written properly. So when I change the one line to:
set tce:save_integrals T F F F F 

the program runs without an error (however it's not useful ^^).

I am very thankful for any hint, advice or solution.
Thanks in advance!
Christopher

Forum Vet
Please try the following input (and swap t/f for tce:readint/writeint and tce:readt/writet during the restart run)
start h2_tce
memory stack 1200 mb heap 500 mb global 2000 mb

geometry
H  1.0 1.0 0.0
H -1.0 1.0 0.0
symmetry c1
end
basis
  * library def2-qzvpp
end

SCF
 direct
end

TCE
  ccsd(t)
  2eorb
  2emet 15
  tilesize 16
END

set tce:writeint t
set tce:readint f

set tce:writet t
set tce:readt f


set tce:tceiop 2048


set tce:nts t

task tce energy

Clicked A Few Times
Hi Edoapra,

thank you for your answer. I tried the input and the integrals are written (so I can see files). However, the CCSD iterations look strange:
 t2 file size   =             7618
 t2 file name   = ./h2_tce.t2
 t2 file handle =       -996
CCSD iterations
 ---------------------------------------------------------
 Iter          Residuum       Correlation     Cpu    Wall 
 ---------------------------------------------------------
NEW TASK SCHEDULING
CCSD_T1_NTS --- OK
CCSD_T2_NTS --- OK
    1   0.0000000000484  -0.0000000000404     3.4     3.4
 -----------------------------------------------------------------
 Iterations converged
 CCSD correlation energy / hartree =        -0.000000000040354
 CCSD total energy / hartree       =        -0.926129209689597


However, on my workstation, it runs correctly. Do you have an idea, what might be the problem?

Thanks again,
Christopher

Forum Vet
Christopher
Since we have not spent a great deal of time to test and/or optimize NWChem on BlueGeneQ, my suggestion is to move to a different platform if you have this opportunity.
If you have to stick to BGP, my suggestions are
1) Instead of TCE, try the CCSD module if you intend to study closed-shell molecules
https://github.com/nwchemgit/nwchem/wiki/CCSD
2) If you want to fix TCE problems on TCE, try to find a baseline that works by
i) using a single process
ii) compiling without OpenMP
iii) recompile TCE with not optimization, e.g.
make FOPTIMIZE="-O0 -g" FDEBUG="-O0 -g"

Clicked A Few Times
Hi Edoapra,

thanks for your help. I finally managed to solve the problem by replacing all integer types by long types in the tce/sort/tce_sort_4kg.c file. The whole tce part was compiled with "-qintsize=8" but the c-code used 32bit integers, so that collided somehow. The code then runs with " 2emet 15".

Maybe this info is useful for someone,
best,
Christopher

Forum Vet
Thank you very much for the bug report.
My guess is that the code is likely to work on other 64-bit cpu since they are little-endian while the PowerPC 440 on BGp is big-endian, and that causes the long/int breakeage

Forum Vet
Christopher
I have opened a github issue on this topic
https://github.com/nwchemgit/nwchem/issues/16
and pushed a fix (thanks to your suggestion) to the hotfix/release-6-8 and master branches.
If you have time to test this fix, you help is greatly appreciated.
To checkout hotfix/release-6-8, please type

git clone -b hotfix/release-6-8 https://github.com/nwchemgit/nwchem.git nwchem-6.8.1

The change NWCHEM_TOP to .../nwchem-6.8.1

Clicked A Few Times
Ok,

the patch is approved!

However, regarding the initial question, I had to remove also "USE_EAF" macro in the "src/tce/tensor_read_write.F"; and I modified the ccsd_energy_loc.F where I added these lines after line 180:
 if (write_ta .and.(mod(iter,save_interval).eq.0)) then
          if(nodezero) then
            write(LuOut,*) 'Saving Amplitudes now...'
          endif
          call util_file_name0('t1amp',.false.,.true.,filename,fldgts)
          unitn=79
          call write_tensor(filename,d_t1,size_t1,unitn)
          call util_file_name0('t2amp',.false.,.true.,filename,fldgts)
          unitn=80
          call write_tensor(filename,d_t2,size_t2,unitn)
          call ga_sync()
        endif


One can then use the following input:
set tce:writeint t
set tce:readint f
set tce:writet t
set tce:readt f
set tce:save_interval 10
set tce:tceiop 2048
set tce:nts t

to save the amplitudes every 10 iterations. I think that's useful.

best,
Christopher

Forum Vet
Thanks for the feedback.
Could you please use the github issue option for this topic at
https://github.com/nwchemgit/nwchem/issues/16
Cheers, Edo

Forum Vet
Dear Dr. Edoapra
 Your input does produce results with surely negligible differences using 
NWCHEM6.8 both on Ubuntu17.10, repeated for three times, and macOS
High Sierra 10.13.2 .
 
I have not employed your patch.
I have already put the log files on your GitHub topic.

Previously, I misread the correction energy for correlation energy,
caused by the busyness and carelessness.
Sorry for that.

It is clearly stated on the NWCHEM6.6 manual that "The only platform for which
restart may cause I/O problems is BlueGene, due to ratio of compute to I/O nodes
(64 on BlueGene/P)".

 Very Best Regards!

Forum Vet
It is clearly stated on the NWCHEM6.6 manual that "The only platform for which

restart may cause I/O problems is BlueGene, due to ratio of compute to I/O nodes

(64 on BlueGene/P)".


Forum >> NWChem's corner >> Running NWChem