Difficulties offloading the ccsd calculation with Xeon Phi


Just Got Here
Hello,

I'm having some difficulties getting the Xeon Phi offload working with NWChem 6.6.

I compiled NWChem without errors, and the vectorization output during compilation showed that the offload sections were properly compiled. I do not have any problems executing the application, and NWChem identifies the Xeon Phi devices (verified with OFFLOAD_REPORT=2).

NWC_RANKS_PER_DEVICE=1 mpirun -np 2 ./nwchem ./input.nw
 argument  1 = ./input.nw

...


                     0  ppn                      2
                     0  offload_master 
                     1  offload_master 
[Offload] [MIC 0] [File]            util_mic_support.c
[Offload] [MIC 0] [Line]            198
[Offload] [MIC 0] [Tag]             Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        3.734172(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        0.000408(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   4 (bytes)

00: micdev=0 nprocs=59 rank_on_dev=0 ranks_per_device=1 affinity='KMP_PLACE_THREADS=59c,4t,0o' pos=27
[Offload] [MIC 0] [File]            util_mic_support.c
[Offload] [MIC 0] [Line]            216
[Offload] [MIC 0] [Tag]             Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        0.004035(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   512 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        0.003767(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   0 (bytes)

[Offload] [MIC 1] [File]            util_mic_support.c
[Offload] [MIC 1] [Line]            198
[Offload] [MIC 1] [Tag]             Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        3.839402(seconds)
[Offload] [MIC 1] [Tag 0] [CPU->MIC Data]   0 (bytes)
[Offload] [MIC 1] [Tag 0] [MIC Time]        0.000371(seconds)
[Offload] [MIC 1] [Tag 0] [MIC->CPU Data]   4 (bytes)

01: micdev=1 nprocs=59 rank_on_dev=0 ranks_per_device=1 affinity='KMP_PLACE_THREADS=59c,4t,0o' pos=27
[Offload] [MIC 1] [File]            util_mic_support.c
[Offload] [MIC 1] [Line]            216
[Offload] [MIC 1] [Tag]             Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        0.004166(seconds)
[Offload] [MIC 1] [Tag 1] [CPU->MIC Data]   512 (bytes)
[Offload] [MIC 1] [Tag 1] [MIC Time]        0.003898(seconds)
[Offload] [MIC 1] [Tag 1] [MIC->CPU Data]   0 (bytes)
...


The problem I'm having is that this is the only time OFFLOAD_REPORT output is generated during execution, even during the ccsd calculation which should also be offloaded. Since these reports are not being generated elsewhere, I can only conclude that the computations are handled by the CPU. I suspect the problem is with my input, and thus I'm hoping someone can help me come up with a working example.

My input file is based on the ccsd sample, and it runs very well on the CPU.

start n2 

geometry
  symmetry d2h
  n 0 0 0.542
end

basis spherical
  n library cc-pvtz
end

mp2
  freeze core
end

task mp2 optimize

ccsd
  freeze core
end

set ccsd:use_trpdrv_omp T
set ccsd:use_trpdrv_offload T

task ccsd(t)
task ccsd
task ccsd+t(ccsd)


According to my sources, Intel and the ccsd documentation, I think that this input file should work and I do not understand why the calculation is not being offloaded. I've also tried all three versions of the ccsd calculation, and verified that NWC_RANKS_PER_DEVICE=0 runs exclusively on the CPU. I'm currently running the input file provide by the Intel source, but I may not have an answer until tomorrow because of the time required to solve each iteration.

Also, to illustrate what my goals are: I'm not using NWChem to solve any particular problem, I'm 100% interested in utilizing the offload to Xeon Phi for performance testing and execution modeling. I just need an input file that will use NWChem in offload mode, reported correctly by OFFLOAD_REPORT.

All help is greatly appreciated, thank you.
Gary

Forum Vet
Gary
The problem with your input file is that is using a different (older) NWChem implementation of the CCSD(T) method that has been ported to the Intel MIC hardware yet.
As described in the documentation at
http://nwchemgit.github.io/index.php/Release66:TCE#CCSD.28T.29_and_MRCCSD.28T.29_implementat...
only the CCSD(T) and MRCCSD(T) part of the TCE NWChem module have been ported to the Intel MIC HW.

Here is the input file you will need to use

start n2

geometry
  symmetry d2h
  n 0 0 0.542
end

basis spherical
  n library cc-pvtz
end

tce
  ccsd(t)
  tilesize 24
  freeze core
  2eorb
  2emet 15
  thresh 1.d+5
end

set tce:nts t

task tce


Here is a snippet of the output file. When the (T) of CCSD(T) starts, can see the Offload Report monitoring data movement.
0:  ---------------------------------------------------------
0:  Iter          Residuum       Correlation     Cpu    Wall
0:  ---------------------------------------------------------
0: NEW TASK SCHEDULING
0: CCSD_T1_NTS --- OK
0: CCSD_T2_NTS --- OK
0:     1   0.1917830793683  -0.4023820724643    14.6     0.9
0:  -----------------------------------------------------------------
0:  Iterations converged
0:  CCSD correlation energy / hartree =        -0.402382072464322
0:  CCSD total energy / hartree       =      -109.382986731335805
0:
0:  Singles contributions
0:
0:  Doubles contributions
0:  CCSD(T)
0:  Using plain CCSD(T) code
0: [Offload] [MIC 0] [File]                    ccsd_t.f
0: [Offload] [MIC 0] [Line]                    1557
0: [Offload] [MIC 0] [Tag]                     Tag 2
0: [Offload] [HOST]  [Tag 2] [CPU Time]        0.007342(seconds)
0: [Offload] [MIC 0] [Tag 2] [MIC Time]        0.000077(seconds)
0:
0: [Offload] [MIC 0] [File]                    ccsd_t.f
0: [Offload] [MIC 0] [Line]                    1557
0: [Offload] [MIC 0] [Tag]                     Tag 3
0: [Offload] [HOST]  [Tag 3] [CPU Time]        0.004298(seconds)


Just Got Here
Edoapra,

This is exactly what I was looking for, thank you!! It worked beautifully!
Gary


Forum >> NWChem's corner >> Running NWChem