Compiling NWChem 6.3 with Myrinet MX


Just Got Here
At our site, we are not able to successfully build a working binary with openmpi or mpich MPI usign any ARMCI_NETWORK set to anything besides MPI-TS.

The MPI-SPAWN and MPI-MT ARMCI_NETWORK options fail immediately at launch. MPI-TS would not be so bad, however the performance is about 5-6x slower than previous versions of NWChem. NWChem compiled fine with our Infiniband cluster.

Any advice on getting NWChem 6.3 compiled with Myrinet MX?

Forum Vet
Michael
What ARMCI_NETWORK value did you use in previous versions of NWChem for your Myrinet MX network?

Just Got Here
Quote:Edoapra Feb 6th 9:24 am
Michael
What ARMCI_NETWORK value did you use in previous versions of NWChem for your Myrinet MX network?


MPI-SPAWN

Forum Vet
Michael
Could you please post the full output and error files or the MPI-SPAWN failure?
Thanks, Edo

Just Got Here
Thanks, unfortunately there isn't much output. Any help on debugging is appreciated. It just crashes on start. Output below:

argument  1 = input.in


There are no allocated resources for the application
 ./nwchem.spawn
that match the requested mapping:
 

Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.




A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.




mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.




MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 15.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.


0:Terminate signal was sent, status=: 15
(rank:0 hostname:cluster3-30.chpc.ndsu.nodak.edu pid:10842):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0
Last System Error Message from Task 4:: No such file or directory
Last System Error Message from Task 5:: No such file or directory
Last System Error Message from Task 6:: No such file or directory
Last System Error Message from Task 7:: No such file or directory
Last System Error Message from Task 0:: No such file or directory
Last System Error Message from Task 1:: No such file or directory
Last System Error Message from Task 2:: No such file or directory
Last System Error Message from Task 3:: No such file or directory
[cluster3-30.chpc.ndsu.nodak.edu:10841] 6 more processes have sent help message help-mpi-api.txt / mpi-abort
[cluster3-30.chpc.ndsu.nodak.edu:10841] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Forum Vet
MPI-MT
Do you get a similar behavior with MPI-MT?
Edo

Just Got Here
Yes, MPI-MT and MPI-SPAWN failed immediately on launch without any meaningful diagnostics. We've tried different MPI implementations with similar results.

I'm willing to do some debugging on my end with some suggestions on where to start, but I was wondering if this was a known issue or if there was something a more experienced person with NWChem would recommend first.

Thanks.

Forum Vet
OpenMPI version
Michael
I will try to reproduce your problem (not on a Myrinet MX system, unfortunately)
What version of OpenMPI have you been using?
Edo


Forum >> NWChem's corner >> Compiling NWChem