How to use ARMCI NETWORK for NWChem 6.3 on SGI ICE X


Clicked A Few Times
I compiled 6.3 with following settings:

setenv NWCHEM_TARGET LINUX64
setenv USE_MPI y
  1. setenv ARMCI_NETWORK VAPI
  2. setenv ARMCI_NETWORK MPI
  3. setenv ARMCI_NETWORK MPI2
setenv ARMCI_NETWORK OPENIB
setenv IB_HOME /usr
setenv IB_INCLUDE $IB_HOME/include
setenv IB_LIB $IB_HOME/lib64
setenv IB_LIB_NAME "-libverbs -libumad -lpthread"
setenv MA_USE_ARMCI_MEM 1
setenv MPI_LOC /opt/sgi/mpt/mpt-2.08
  1. setenv MPI_LOC /app/intel/impi/4.0.3.008/intel64/
setenv MPI_LIB $MPI_LOC/lib
setenv MPI_INCLUDE $MPI_LOC/include
setenv MPI_LIB $MPI_LOC/lib
setenv MPI_INCLUDE $MPI_LOC/include
setenv LIBMPI -lmpi
setenv NWCHEM_MODULES all
setenv DISABLE_F77 1
setenv MKL_LIB /app/intel/mkl/lib/intel64
setenv MKL_INC /app/intel/mkl/include
setenv INTEL_LIB /app/intel/lib/intel64/
setenv LASOPT "-L$MKL_LIB -I$MKL_INC -L$INTEL_LIB -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm"

But the binary doesn't work when it runs across nodes (it works within a node) , no matter what ARMCI_NETWORK used. It gives following errors:

from .out file:
argument  1 = kiet_scf.nw
-10016:Segmentation Violation error, status=: 11
(rank:-10016 hostname:r28i1n16 pid:2263722):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
16:Child process terminated prematurely, status=: 256
(rank:16 hostname:r28i1n16 pid:2263705):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/signaltrap.c:SigChldHandler():178 cond:0
-10000:Segmentation Violation error, status=: 11
(rank:-10000 hostname:r27i0n17 pid:4037891):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
0:Child process terminated prematurely, status=: 256
(rank:0 hostname:r27i0n17 pid:4037874):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/signaltrap.c:SigChldHandler():178 cond:0

from error file:

ARMCI master: wait for child process (server) failed:: No child processes
MPT: Global rank 16 is aborting with error code 256.
    Process ID: 2263705, Host: r28i1n16, Program: /work1/app/nwchem/nwchem-6.3.revision2b/bin/LINUX64/nwchem

Please advise!

Forum Vet
Frank
Could you please give more details about the hardware you are using?
Does it have any Intel Xeon Phi?
Thanks, Edo

Clicked A Few Times
Edo,

No, we don't have Intel Xeon Phi.

Total Nodes 4590
Operating System RHEL
Cores/Node 16
Core Type ntel E5 Sandy Bridge
Core Speed 2.6 GHz
Memory/Node 32 GBytes
Accessible Memory/Node 30 GBytes
Memory Model Shared on node.
                               Distributed across cluster.
Interconnect Type FDR 14x Infiniband;
                                      Enhanced LX Hypercube

Frank

Forum Vet
Xiaofeng
I don't see anything obviously wrong in your settings.
However, I strong recommend unsetting MA_USE_ARMCI_MEM, since it is often cause of crashes.
The only part that might cause you trouble could be the definition of the BLAS library.
My suggestion is to switch from thread to sequential MKL and to explicitly pass GA the size of integers (8).
To do this, please follow these steps

1)
setenv BLASOPT "-L$MKL_LIB -L$INTEL_LIB -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lpthread -lm"
setenv BLAS_LIB "-L$MKL_LIB -L$INTEL_LIB -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lpthread -lm"
setenv BLAS_SIZE 8

2) cd $NWCHEM_TOP/src/tools

3) make FC=ifort clean

4) make FC=ifort

5) cd .. ; make FC=ifort link

Cheers, Edo

Clicked A Few Times
Hi Edo,

I recompiled it with the settings you suggested. But it still failed to run. The error message is a little different: Looks like to me this time it failed on rank 0:

from error file:
Last System Error Message from Task 0:: Address already in use
MPT: Global rank 0 is aborting with error code -1.
    Process ID: 1931615, Host: r29i2n3, Program: /work1/app/nwchem/nwchem-6.3.revision2b/bin/LINUX64/nwchem

from .out file:
argument  1 = kiet_scf.nw
0:armci_ListenSockAll: listen failed: 0
(rank:0 hostname:r29i2n3 pid:1931615):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/sockets/sockets.c:armci_ListenSockAll():614 cond:0

Thanks,
Xiaofeng

Forum Vet
Could you please post the first 10 lines of
$NWCHEM_TOP/src/tools/build/config.log
and of
$NWCHEM_TOP/src/tools/build/armci/config.log

Clicked A Few Times
Edo,

I tried to recompile. I found configure error:

configure: searching for OPENIB...
checking infiniband/verbs.h usability... no
checking infiniband/verbs.h presence... no
checking for infiniband/verbs.h... no
configure: error: test for ARMCI_NETWORK=OPENIB failed

But if I set to other ARMCI_NETWORK, like
setenv ARMCI_NETWORK VAPI

then I had:

configure: WARNING: No ARMCI_NETWORK specified, defaulting to SOCKETS

I don't understand it.

Thanks,
Xiaofeng

Forum Vet
ib-verbs
Xiaofeng,
A prerequisite for ARMCI_NETWORK=OPENIB to work is to have the IB-verbs (or ibvervs) library installed. Your compilation is failing because of that.

My suggestion is to recompile the tools using ARMCI_NETWORK=MPI-TS

Just Got Here
Compiling NWChem
Hi, I am compiling NWChem. In "MPI variables needed to ..." step I am confused which MPI should be selected. Also, I can not find MPI path (when I want to follow for example: setenv MPI_LIB <Your path to MPICH2 libraries>/lib). My system is not a single processor and these environmental variables have to be defined.
Thanks,
Nazanin

Forum Vet
Quote:Nm448 Dec 5th 11:15 pm
Hi, I am compiling NWChem. In "MPI variables needed to ..." step I am confused which MPI should be selected. Also, I can not find MPI path (when I want to follow for example: setenv MPI_LIB <Your path to MPICH2 libraries>/lib). My system is not a single processor and these environmental variables have to be defined.
Thanks,
Nazanin


Nazanin
What is the output you get on our computer after typing the command

mpif90 -show

Just Got Here
Since I can not find the path to MPI (MPICH or MPICH2), I get "Your: No such file or directory" message after using "setenv MPI_LIB <Your path to MPICH2 libraries>/lib" command.

Just Got Here
I am using all these commands on my Linux 64 system.

Just Got Here
I have made nwchem directory in workspace/mnazanin (workspace/mnazanin/nwchem) and untarred Nwchem-6.3.revision2-src.2013-10-17.tar.gz in nwchem directory. However, I could find MPI in "/export/apps/pgi/linux86/9.0-3/EXAMPLES/MPI".

I used "setenv MPI_LIB export/apps/pgi/linux86/9.0-3/EXAMPLES/MPI/lib" and "setenv MPI_INCLUDE export/apps/pgi/linux86/9.0-3/EXAMPLES/MPI/include" commands.

In "Building the NWChem binary" step, I used "cd $NWCHEM_TOP/src" command and I got "workspace/mnazanin/nwchem/src: No such file or directory." message. However, I used "setenv NWCHEM_TOP workspace/mnazanin/nwchem" to defines the top directory of the NWChem source tree.

Gets Around
It's been a while, but I had no issues running NWChem on SGI ICE with ARMCI-MPI (http://wiki.mpich.org/armci-mpi/index.php/NWChem). I don't know if SGI's implementation of MPI-3 is generally available but I was using it without issues on their in-house systems almost a year ago.

Because SGI ICE is a shared-memory system, you can just build MPICH from source and the RMA performance will be just fine. Collectives not so much, but those are not even close to being a bottleneck.

If you want to follow up on the topic of ARMCI-MPI, please use http://wiki.mpich.org/armci-mpi/index.php/Main_Page#Mailing_Lists. I do not monitor this forum.


Forum >> NWChem's corner >> Compiling NWChem