Nwchem 6.1.1 compiling failed on infiniband network


Clicked A Few Times
Hi,

First of all, I have realized that there is a release 6.3 of Nwchem available so far and recently I have successfully compiled and installed this release on a cluster equipped with infiniband network using the configuration script as in this thread http://nwchemgit.github.io/Special_AWCforum/st/id1001/Errors_Running_Nwchem_6.3.ht.... My work is most relevant to qmmm modules and most of them are working fine in this version. However, for some reason I cannot use the 'qmmm fep' module and I always get the following error:

qmmm_bq_data_load Failed bq_pset 0

This seems to happen for NWchem 6.3 on chinook or even the latest release (Nwchem-src-2013-10-16) too. When I changed to Nwchem 6.1.1, the job could run without any problem. For this reason, I have also decided to install the 6.1.1 version on the cluster I mentioned above . The code could be successfully compiled using the same configuration script used in compiling the release 6.3 but it failed to run a job. This is what I always get when setting ARMCI_NETWORK=OPENIB:

argument  1 = neb-0.nw
0: client RTR->RTS i=0 rc=22
12: client RTR->RTS i=0 rc=22
(rank:12 hostname:compute-0-40.local pid:3018):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:vapi_connect_client():1460 cond:!rc

However, I managed to compiled and run jobs (as well as using 'qmm fep' module) in this version by setting ARMCI_NETWORK=MPI-MT. Please note that I have read this thread http://nwchemgit.github.io/Special_AWCforum/st/id554/Infiniband_Install.html and compiled the code with the MPI implementation that supports MPI_THREAD_MULTIPLE but again the performance is poorer compared to running the same small test with release 6.3. For example, the job finishes within 30 s with nwchem 6.3 but takes ~100 s in nwchem 6.1.1.


If anyone has any idea how to fix any of these two problems (on how to get nwchem 6.3 works on 'qmmm fep' module or to improve the performance of nwchem 6.1.1), I would be greatly appreciated.

Forum Vet
Tee,
Could you please give more details about your network hardware?
Are you using QLogic HW?

Edo

Clicked A Few Times
Edo, yes I use QLogic HW.

Quote:Edoapra Nov 14th 12:10 am
Tee,
Could you please give more details about your network hardware?
Are you using QLogic HW?

Edo

Forum Vet
Tee
The tools shipped with NWChem 6.1.1 (ga-5-1) are incompatible with QLogic HW when ARMCI_NETWORK=OPENIB is used.

The tools shipped with NWChem 6.3 (ga-5-1) are compatible with QLogic HW, instead

You can try to link NWChem 6.1.1 with the ga-5-2 tools

Here are the instructions

  1. cd $NWCHEM_TOP/src/tools
  2. copy here the ga-5-2 directory from the nwchem-6.3/src/tools tree
  3. compile by typing make GA_DIR=ga-5-2
  4. relink by typing cd $NWCHEM_TOP/src; make link
  5. if the link operation shows some undefined symbols (mostly related to ARMCI), please define (or redefine) the BLASOPT environmental variable by adding -larmci, e.g.
BLASOPT=-larmci

Clicked A Few Times
Do I need to recompile source code too? Or are these steps for linking the compiled source code to ga-5-2?

Clicked A Few Times
Never mind. Successfully recompiled and linked!! The code (nwchem 6.1.1) now works with a small test case as expected. Thank you very much.

Anyway, I compiled openmpi with the following configuration:

./configure --prefix=/home/tpirojsi/MPI-1.4.3-psm-2.9 --with-openib=/usr --with-openib-libdir=/usr/lib64 --with-psm=/home/tpirojsi/psm-install-2.9/usr --with-psm-libdir=/home/tpirojsi/psm-install-2.9/libraries --enable-mpi-threads

And then the nwchem source code was successfully built and linked as suggested above. However, when applying to the real problem, I have experienced 2 main issues here. The first is about network connectivity. The code compiled with the openmpi configured as described above can run jobs but most of the time, I see the following error:

ipath_userinit: assign_context command failed: Network is down

I searched for solutions about this problem. Some said the PSM_SHAREDCONTEXTS_MAX variable should be set properly. I have tried to do so but the problem still exists (jobs still randomly runs or dies). Without making use of infinipath-psm, however, this problem is gone but the performance cannot be reliable. For example, I ran QM dynamics and equilibrated the system for 5 ps (5000 steps) followed by 1000 data gathering steps with 24 cpus (2 nodes and 12 cpus each). With multiple runs, the wall times were ranging from 20 mins to an hour or more. This seems to be a problem with message passing to me. However, with infinipath-psm, the performance is pretty consistent but above error is often experienced.

The second issue is about scalibility. For some reason, I cannot get a scalibility beyond 24 cpus. My work is related to QMMM based calculation and I'm not sure this is a common issue for this kind of calculation or there might be other factors related or not.

Quote:Tpirojsi Nov 14th 1:28 am
Do I need to recompile source code too? Or are these steps for linking the compiled source code to ga-5-2?

Forum Vet
Tee
The NWChem group doesn't have access to any QLogic based cluster, therefore we cannot be of any help other than the instructions I passed you a few days ago.
Edo

Clicked A Few Times
I heard somewhat that the cluster I'm working on this problem has some issures regarding network connectivity. That might explain why Nwchem doesn't work efficient'y as expected.

I'm truely sorry for any trouble or inconvenience I might have caused so far. I truely appreciate your help and patience.

Best,
Tee

Quote:Edoapra Nov 18th 2:43 am
Tee
The NWChem group doesn't have access to any QLogic based cluster, therefore we cannot be of any help other than the instructions I passed you a few days ago.
Edo


Forum >> NWChem's corner >> Compiling NWChem