NWChem 6.5 and Intel Xeon Phis with intel 14.0.3 and intelmpi 4.1.3


Clicked A Few Times
Hi All,
I'm after come help with compiling Nwchem-6.5.revision26243 for use with Intel Xeon Phis.
I'm using intel 14.0.3 and intelmpi 4.1.3.

if I set the following environment variables:
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX_UBOUND=65536
export USE_MPI=y
export NWCHEM_MODULES=all\ python
export USE_MPIF=y
export USE_MPIF4=y
export LIBMPI="-lmpi -lmpigf -lmpigi -lrt -lpthread"
export SCALAPACK_LIB=" -mkl -openmp -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export SCALAPACK="$SCALAPACK_LIB"
export LAPACK_LIB="-mkl -openmp  -lpthread -lm"
export BLAS_LIB="$LAPACK_LIB"
export BLASOPT="$LAPACK_LIB"
export USE_SCALAPACK=y
export SCALAPACK_SIZE=8
export BLAS_SIZE=8
export LAPACK_SIZE=8
export PYTHONHOME=/usr
export PYTHONVERSION=2.6
export PYTHONLIBTYPE=so
export USE_PYTHON64=y
export USE_CPPRESERVE=y
export USE_NOFSCHECK=y
export LIB_DEFINES="${LIB_DEFINES}  -DDFLT_TOT_MEM=417262336"
export USE_CCSDTQ=yes


and then use the contrib/distro-tools/build_nwchem script, I get an executable that runs perfectly. However, if I set the two environment variables required for OpenMP and Xeon Phi off-loading:
 export USE_OPENMP=1
export USE_OFFLOAD=1

I continuously get the errors:
           Directory information
           ---------------------

  0 permanent = /scratch/cjn/svn/NWChem-6.5-2/Nwchem-6.5.revision26243-src.2014-09-10/QA/scratchdir
  0 scratch   = /scratch/cjn/svn/NWChem-6.5-2/Nwchem-6.5.revision26243-src.2014-09-10/QA/scratchdir


                     0 ppn                          4
0:Floating Point Exception error, status=: 8
1:Floating Point Exception error, status=: 8
(rank:0 hostname:pillowb2 pid:17058):ARMCI DASSERT fail ../../ga-5-3/armci/src/common/signaltrap.c:SigFpeHandler():249 cond:0
2:Floating Point Exception error, status=: 8
3:Floating Point Exception error, status=: 8
(rank:1 hostname:pillowb2 pid:17059):ARMCI DASSERT fail ../../ga-5-3/armci/src/common/signaltrap.c:SigFpeHandler():249 cond:0
(rank:2 hostname:pillowb2 pid:17060):ARMCI DASSERT fail ../../ga-5-3/armci/src/common/signaltrap.c:SigFpeHandler():249 cond:0
(rank:3 hostname:pillowb2 pid:17061):ARMCI DASSERT fail ../../ga-5-3/armci/src/common/signaltrap.c:SigFpeHandler():249 cond:0

Does anyone have any ideas on what is going on and how to fix this?

kind regards,

Chris.

Forum Vet
mpirun options
Chris
What command are you using to start NWChem?

Clicked A Few Times
Hi Edoardo,

I'm typing
mpirun $NWCHEM_TOP/bin/$NWCHEM_TARGET/nwchem auh2o.nw > auh2o.out


(the actualk command is (as taken from the NWChem output file):
mpirun --rsh=ssh --parallel-startup -perhost 16 -n 4 -envuser $NWCHEM_TOP/bin/LINUX64/nwchem auh2o.nw

Forum Vet
Chris
I am afraid that something goes wrong in the code detection of Xeon Phi cards.
Could you please try the following:
1) edit the file
$NWCHEM_TOP/src/util/util_mic_support.c
change line 11 from
#define DEBUG_ 1
to
#define DEBUG 1
(that is, delete the "underscore" sign)
2) recompile
cd $NWCHEM_TOP/src/util
make FC=ifort
3) relink
cd ..
make FC=ifort link

Could you please post the new error/debug messages you will be getting trying to run the code, plus the output of the commands described at 2) and 3)?

Thanks

Clicked A Few Times
Hi Again Edorado - apologies for the delay in reply.

I've got the QA tests running on a single node with 2 Phis attached.

The next issues I have is when I try and run any calculation on more than one node.

e.g. the C2H4.nwQA test gives the following when run on nodes=2:ppn=2

2: error ival=-5
(rank:2 hostname:pillowb5 pid:78877):ARMCI DASSERT fail. ../../ga-5-3/armci/src/devices/openib/openib.c:armci_send_complete():459 cond:(pdscr->status==IBV_WC_SUCCESS)
Last System Error Message from Task 2:: Bad address
3: error ival=5
(rank:3 hostname:pillowb5 pid:78877):ARMCI DASSERT fail. ../../ga-5-3/armci/src/devices/openib/openib.c:armci_send_complete():459 cond:(pdscr->status==IBV_WC_SUCCESS)
Last System Error Message from Task 3:: Bad address
application called MPI_Abort(comm=0x84000001, 1) - process 3
application called MPI_Abort(comm=0x84000001, 1) - process 2
rank = 3, revents = 8, state = 8
Assertion failed in file ../../socksm.c at line 2963: (it:plfd->revents & POLLERR) == 0
internal ABORT - proces 1
rank = 2, revents = 8, state = 8
Assertion failed in file ../../socksm.c at line 2963: (it:plfd->revents & POLLERR) == 0
internal ABORT - proces 0


I do not get this problem when I don't use the USE_OPENMP=1 and USE_OFFLOAD=1 variables when compiling.

Forum Vet
ARMCI_OPENIB_DEVICE mlx4_0
Did you set ARMCI_OPENIB_DEVICE=mlx4_0?

http://nwchemgit.github.io/index.php/Compiling_NWChem#How-to:_Intel_Xeon_Phi

Clicked A Few Times
Ahh, no, I missed that - I had an old printed copy of the compile page where that intruction was missed off.

I'll give it a go and let you know how I get on.

many thanks.

Chris.

Clicked A Few Times
Thanks for all of your help Edorado,
I seem to have it working now.
looking forward to hearing the NWChem talk at SC14 next week.
regards,
Chris.


Forum >> NWChem's corner >> Compiling NWChem