Hello,
I'm having some difficulties getting the Xeon Phi offload working with NWChem 6.6.
I compiled NWChem without errors, and the vectorization output during compilation showed that the offload sections were properly compiled. I do not have any problems executing the application, and NWChem identifies the Xeon Phi devices (verified with OFFLOAD_REPORT=2).
NWC_RANKS_PER_DEVICE=1 mpirun -np 2 ./nwchem ./input.nw
argument 1 = ./input.nw
...
0 ppn 2
0 offload_master
1 offload_master
[Offload] [MIC 0] [File] util_mic_support.c
[Offload] [MIC 0] [Line] 198
[Offload] [MIC 0] [Tag] Tag 0
[Offload] [HOST] [Tag 0] [CPU Time] 3.734172(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time] 0.000408(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 4 (bytes)
00: micdev=0 nprocs=59 rank_on_dev=0 ranks_per_device=1 affinity='KMP_PLACE_THREADS=59c,4t,0o' pos=27
[Offload] [MIC 0] [File] util_mic_support.c
[Offload] [MIC 0] [Line] 216
[Offload] [MIC 0] [Tag] Tag 1
[Offload] [HOST] [Tag 1] [CPU Time] 0.004035(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data] 512 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time] 0.003767(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data] 0 (bytes)
[Offload] [MIC 1] [File] util_mic_support.c
[Offload] [MIC 1] [Line] 198
[Offload] [MIC 1] [Tag] Tag 0
[Offload] [HOST] [Tag 0] [CPU Time] 3.839402(seconds)
[Offload] [MIC 1] [Tag 0] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 1] [Tag 0] [MIC Time] 0.000371(seconds)
[Offload] [MIC 1] [Tag 0] [MIC->CPU Data] 4 (bytes)
01: micdev=1 nprocs=59 rank_on_dev=0 ranks_per_device=1 affinity='KMP_PLACE_THREADS=59c,4t,0o' pos=27
[Offload] [MIC 1] [File] util_mic_support.c
[Offload] [MIC 1] [Line] 216
[Offload] [MIC 1] [Tag] Tag 1
[Offload] [HOST] [Tag 1] [CPU Time] 0.004166(seconds)
[Offload] [MIC 1] [Tag 1] [CPU->MIC Data] 512 (bytes)
[Offload] [MIC 1] [Tag 1] [MIC Time] 0.003898(seconds)
[Offload] [MIC 1] [Tag 1] [MIC->CPU Data] 0 (bytes)
...
The problem I'm having is that this is the only time OFFLOAD_REPORT output is generated during execution, even during the ccsd calculation which should also be offloaded. Since these reports are not being generated elsewhere, I can only conclude that the computations are handled by the CPU. I suspect the problem is with my input, and thus I'm hoping someone can help me come up with a working example.
My input file is based on the ccsd sample, and it runs very well on the CPU.
start n2
geometry
symmetry d2h
n 0 0 0.542
end
basis spherical
n library cc-pvtz
end
mp2
freeze core
end
task mp2 optimize
ccsd
freeze core
end
set ccsd:use_trpdrv_omp T
set ccsd:use_trpdrv_offload T
task ccsd(t)
task ccsd
task ccsd+t(ccsd)
According to my sources, Intel and the ccsd documentation, I think that this input file should work and I do not understand why the calculation is not being offloaded. I've also tried all three versions of the ccsd calculation, and verified that NWC_RANKS_PER_DEVICE=0 runs exclusively on the CPU. I'm currently running the input file provide by the Intel source, but I may not have an answer until tomorrow because of the time required to solve each iteration.
Also, to illustrate what my goals are: I'm not using NWChem to solve any particular problem, I'm 100% interested in utilizing the offload to Xeon Phi for performance testing and execution modeling. I just need an input file that will use NWChem in offload mode, reported correctly by OFFLOAD_REPORT.
All help is greatly appreciated, thank you.
Gary
|