Nwchem 6.1.1 compiling failed on infiniband network


Click here for full thread
Clicked A Few Times
Never mind. Successfully recompiled and linked!! The code (nwchem 6.1.1) now works with a small test case as expected. Thank you very much.

Anyway, I compiled openmpi with the following configuration:

./configure --prefix=/home/tpirojsi/MPI-1.4.3-psm-2.9 --with-openib=/usr --with-openib-libdir=/usr/lib64 --with-psm=/home/tpirojsi/psm-install-2.9/usr --with-psm-libdir=/home/tpirojsi/psm-install-2.9/libraries --enable-mpi-threads

And then the nwchem source code was successfully built and linked as suggested above. However, when applying to the real problem, I have experienced 2 main issues here. The first is about network connectivity. The code compiled with the openmpi configured as described above can run jobs but most of the time, I see the following error:

ipath_userinit: assign_context command failed: Network is down

I searched for solutions about this problem. Some said the PSM_SHAREDCONTEXTS_MAX variable should be set properly. I have tried to do so but the problem still exists (jobs still randomly runs or dies). Without making use of infinipath-psm, however, this problem is gone but the performance cannot be reliable. For example, I ran QM dynamics and equilibrated the system for 5 ps (5000 steps) followed by 1000 data gathering steps with 24 cpus (2 nodes and 12 cpus each). With multiple runs, the wall times were ranging from 20 mins to an hour or more. This seems to be a problem with message passing to me. However, with infinipath-psm, the performance is pretty consistent but above error is often experienced.

The second issue is about scalibility. For some reason, I cannot get a scalibility beyond 24 cpus. My work is related to QMMM based calculation and I'm not sure this is a common issue for this kind of calculation or there might be other factors related or not.

Quote:Tpirojsi Nov 14th 1:28 am
Do I need to recompile source code too? Or are these steps for linking the compiled source code to ga-5-2?