error running nwchem openmpi in parallel


Clicked A Few Times
Hi All

I am currently trying to run the nwchem_openmpi executable for linux platforms (CentOS 7.1 in my case) in parallel.
I am using the nwchem-openmpi-6.6.27746-22.el7.x86_64 package
I have installed the mpirun (Open MPI) version 1.10.0 and am executing with the command:

mpirun -np 2  nwchem_openmpi geo.in > geo.out
I get the following error:
[Cuba:04638] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 367
[Cuba:04582] tcp_peer_recv_connect_ack: invalid header type: -882573312

geo.out:
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[52069,1],1]
Exit code: 1
--------------------------------------------------------------------------

Am I using the correct versions/syntax for parallel execution?
What version of OpenMPI are the nwchem_openmpi executable compiled with?

Forum Vet
Rasmus
nwchem_openmpi might be broken on Centos 7/EPEL for the same reason that ELPA is broken.
Anyhow, are you sure that you are using the openmpi mpirun and not the mpich one?
What is the output of the command
which miprun
?

Anyhow, this is what I get (and it does not seem promising)
/usr/lib64/openmpi/bin/mpirun  -np 2 /usr/lib64/openmpi/bin/nwchem_openmpi 
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
/usr/lib64/openmpi/bin/nwchem_openmpi: error while loading shared libraries: libmpi.so.1: cannot open shared object file: No such file or directory
/usr/lib64/openmpi/bin/nwchem_openmpi: error while loading shared libraries: libmpi.so.1: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[55611,1],0]
  Exit code:    127
--------------------------------------------------------------------------
[edo@localhost ~]$
$ ls -l /usr/lib64/openmpi/lib/libmpi.so.*
lrwxrwxrwx. 1 root root     16 Mar 30 10:57 /usr/lib64/openmpi/lib/libmpi.so.12 -> libmpi.so.12.0.0
-rwxr-xr-x. 1 root root 863512 Nov 20 10:46 /usr/lib64/openmpi/lib/libmpi.so.12.0.0

Clicked A Few Times
Hi Edo

I am using mpirun with nwchem_openmpi which gives the error in the top post. I also tried mpirun with nwchem_mpich which gives the following error:
Fatal error in PMPI_Errhandler_set: Invalid communicator, error stack:
PMPI_Errhandler_set(117): MPI_Errhandler_set(comm=0x5651f5a0, errh=0x565202c0) failed
PMPI_Errhandler_set(70).: Invalid communicator
Fatal error in PMPI_Errhandler_set: Invalid communicator, error stack:
PMPI_Errhandler_set(117): MPI_Errhandler_set(comm=0xc11955a0, errh=0xc11962c0) failed
PMPI_Errhandler_set(70).: Invalid communicator
Fatal error in PMPI_Errhandler_set: Invalid communicator, error stack:
PMPI_Errhandler_set(117): MPI_Errhandler_set(comm=0xbbb8f5a0, errh=0xbbb902c0) failed
PMPI_Errhandler_set(70).: Invalid communicator


and the output file contained the following:
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[52842,1],3]
  Exit code:    1
--------------------------------------------------------------------------


When I run ldd on the nwchem_mpich and nwchem_openmpi binaries I see that they use both libmpi.so.1 and libmpi.so.12, which is what caused the ELPA error for the binary I compiled myself. Getting rid of ELPA solved the problem for me so far for the binary I compiled myself. However, the problem should be solved for the repository packages as well.


Forum >> NWChem's corner >> Running NWChem