Compiling for MPI.


Clicked A Few Times
I have been working for a few months with NWChem on a workstation, but now need to ramp up the size of the simulations I am doing. I have unsuccessfully been trying to get NWChem 6 with MPI to compile
for a while now. Any counsel to resolve this will be appreciated.
My environment variables which I've set for compiling (with the greatest success rate so far) are:
export NWCHEM_TARGET=LINUX64
export NWCHEM_TOP=~/nwchem/nwchem-6.0/
export NWCHEM_MODULES=all
export LARGE_FILES=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"

export CC=gcc
export FC=gfortran

export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/local/mvapich2-1.6-gcc
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpich"

The compilation is successful with gcc/gfortran, although switching everything to the corresponding Intel compilers and modules consistently errors out. The cluster is running Scientific Linux over IB with either MVAPICH or OpenMPI, with gcc/gfortran v.4.4.5; GNU Make v.3.81.
The output when I run is
[davis68@taub302 uo2-work]$ mpiexec ~/bin/nwchem lda-147.nw 
ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'.
-10012:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10012 hostname:taub448 pid:24214):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
12:Child process terminated prematurely, status=: 256
(rank:12 hostname:taub448 pid:24188):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
-10000:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10000 hostname:taub302 pid:21006):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
0:Child process terminated prematurely, status=: 256
(rank:0 hostname:taub302 pid:20981):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0

There is a long wait after the first line, ``ARMCI configured for 2 cluster nodes... before the other messages appear.

Forum Vet
Please carefully read the INSTALL file, section about openIB. You need to specify the ARMCI_NETWORK and the location of IB libraries.

Bert


Quote:Davis68 Dec 19th 8:48 pm
I have been working for a few months with NWChem on a workstation, but now need to ramp up the size of the simulations I am doing. I have unsuccessfully been trying to get NWChem 6 with MPI to compile
for a while now. Any counsel to resolve this will be appreciated.
My environment variables which I've set for compiling (with the greatest success rate so far) are:
export NWCHEM_TARGET=LINUX64
export NWCHEM_TOP=~/nwchem/nwchem-6.0/
export NWCHEM_MODULES=all
export LARGE_FILES=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"

export CC=gcc
export FC=gfortran

export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/local/mvapich2-1.6-gcc
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpich"

The compilation is successful with gcc/gfortran, although switching everything to the corresponding Intel compilers and modules consistently errors out. The cluster is running Scientific Linux over IB with either MVAPICH or OpenMPI, with gcc/gfortran v.4.4.5; GNU Make v.3.81.
The output when I run is
[davis68@taub302 uo2-work]$ mpiexec ~/bin/nwchem lda-147.nw 
ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'.
-10012:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10012 hostname:taub448 pid:24214):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
12:Child process terminated prematurely, status=: 256
(rank:12 hostname:taub448 pid:24188):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
-10000:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10000 hostname:taub302 pid:21006):ARMCI DASSERT fail. sockets.c:armci_AcceptSockAll():635 cond:0
0:Child process terminated prematurely, status=: 256
(rank:0 hostname:taub302 pid:20981):ARMCI DASSERT fail. signaltrap.c:SigChldHandler():167 cond:0
ARMCI master: wait for child process (server) failed:: No child processes
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0

There is a long wait after the first line, ``ARMCI configured for 2 cluster nodes... before the other messages appear.

Clicked A Few Times
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"


This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
 argument  1 = lda-147.nw



============================== echo of input deck ==============================
...
normal output for initial processing
...


NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
          *               NWPW PSPW Calculation              *
...
     >>>  JOB STARTED       AT Fri Dec 23 14:18:03 2011  <<<
          ================ input data ========================
 Pack_init:error pushing stack        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15

In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.

Forum Vet
COuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).

Bert


Quote:Davis68 Dec 26th 3:49 pm
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"


This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
 argument  1 = lda-147.nw



============================== echo of input deck ==============================
...
normal output for initial processing
...


NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
          *               NWPW PSPW Calculation              *
...
     >>>  JOB STARTED       AT Fri Dec 23 14:18:03 2011  <<<
          ================ input data ========================
 Pack_init:error pushing stack        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15

In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.

Forum Vet
User specified 16Gbyte in the input deck, where the input is per processor.

Bert



Quote:Bert Dec 28th 9:08 pm
COuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).

Bert


Quote:Davis68 Dec 26th 3:49 pm
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"


This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
 argument  1 = lda-147.nw



============================== echo of input deck ==============================
...
normal output for initial processing
...


NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
          *               NWPW PSPW Calculation              *
...
     >>>  JOB STARTED       AT Fri Dec 23 14:18:03 2011  <<<
          ================ input data ========================
 Pack_init:error pushing stack        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15

In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.

Forum Vet
When compiling it is recommended to set the following three environment variables:

USE_MPI y
USE_MPIF y
USE_MPIF4 y

By adding the third environment variables and recompiling the user (Neal) was able to successfully run.

Bert



Quote:Bert Dec 28th 10:31 pm
User specified 16Gbyte in the input deck, where the input is per processor.

Bert



Quote:Bert Dec 28th 9:08 pm
COuld you send me the complete input and output file at bert.dejong@pnnl.gov. And, can you tell me how much memory you have per node (which has 12 processors I see).

Bert


Quote:Davis68 Dec 26th 3:49 pm
OK, so I have reverted to specifying ARMCI_NETWORK as per Bert's advice, with the environment variables
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX=256
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"


This is mostly successful. Execution on two nodes yields the following output.
ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
 argument  1 = lda-147.nw



============================== echo of input deck ==============================
...
normal output for initial processing
...


NWChem correctly gets the information that there are 24 processors (2 nodes x 12), so the program is getting the MPI support information from the OS (great!). Then it crashes on an ARMCI DASSERT fail. The errors which appear (in order, I think, but stderr is interleaved from each node) follow. This is immediately as a pspw geometry optimization starts.
          *               NWPW PSPW Calculation              *
...
     >>>  JOB STARTED       AT Fri Dec 23 14:18:03 2011  <<<
          ================ input data ========================
 Pack_init:error pushing stack        0
 ------------------------------------------------------------------------
 ------------------------------------------------------------------------
  current input line : 
     0: 
...
Last System Error Message from Task X:: No such file or directory
(rank:X hostname:taub510 pid:18040):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process X
0:Terminate signal was sent, status=: 15

In order to make sure that this wasn't my NW file's fault, I tried it with the pspw example for C2H6 with the same results. What would you suggest to get past this impasse? Thanks.

Clicked A Few Times
The solution, in this case, turned out to be adding another environment variable, USE_MPIF4=y. Thanks Bert.


Forum >> NWChem's corner >> Compiling NWChem