NWChem computation: multi-CPU run never converges

Dear all,

I'm trying to resolve a problem with our user's computations. I've prepared the NWChem 6.0 installation (64-bit systems, Debian 6.0) for him, compiled by GNU compilers (version 4.3.2) with MPI support (OpenMPI 1.4.3) with the following options:

export NWCHEM_TOP=/software/NWChem-6.0/source/nwchem-6.0-src/
export NWCHEM_MODULES="pnnl"

export PYTHONHOME=$NWCHEM_TOP/../python-2.6.2/
export USE_PYTHON64=y

export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export MPI_LIB=/software/openmpi-1.4.3/lib/
export MPI_INCLUDE=/software/openmpi-1.4.3/include/


export FC=gfortran

The NWChem has successfully compiled: when running a single CPU version of the following computation, the computation successfully finishes. BUT, when running a multi-CPU version, the computation never converges (within the CCSD iterations) -- the computation crashes with "maxiter exceeded" and subsequent "ARMCI DASSERT fail (260)" messages (even though a single-CPU version conveges within 20 iterations, the multi-CPU does not converge even running for 1000 iterations).

The computation:

start test

memory 7000 mb

geometry units angstroms
S 0.000000 0.000000 0.000000
S 0.000000 0.000000 1.912540
Cl 1.961192 0.000000 2.612737

 S library aug-cc-pVDZ
Cl library aug-cc-pVDZ


io replicated
tilesize 15
freeze core atomic
nroots 2
targetsym a'
thresh 1.0d-5

task tce energy

I've tried various compilers (GNU, Intel), various MPI libraries (OpenMPI 1.4.3, MPICH2 1.4.1p1) -- even though most of another computations succesfully finish (no matter whether being run on single or multi processors), the above computation behaves equally (badly).

Please, is there somebody who can give me a hint, how to resolve the issue? I don't have any other idea... ;-(

Thanks a lot in advance!
Tom Rebok, Czech Republic.

PS: The build log is available here: make.log

I do not believe this is a kompile issue but rather a run issue. How much memory do you have per processor (not per node, but per core)? The memory keyword in NWChem is per core. Could you try running on multiple processors with the memory keyword set to maybe 1000 mb at most to see if this works.

What is the hardware you are running on that forces you to use MPI-SPAWN?



Dear Bert,

at first, sorry for my late response (I've had a vacation).

I do run the computation using various nodes in our computing infrastructure -- ranging from less-CPU nodes (Dual Core AMD Opteron 885, 16 cores and 64GB of memory) to SMP nodes (Intel Xeon E7 4860, 80 cores, 512GB of memory). On each of these clusters, I did have 4 cores and 50GB of memory reserved on a single node -- no matter which machine I use, all the computations fail in the same fashion (described initially). (I've also tried to specify just 1GB of memory as you have suggested; however, this resulted in the same error).

To illustrate a run, here is a run log on the SMP node (80 cores, 512GB of memory) -- the computation obtained 4CPUs and 50GB of memory reserved by our scheduling system: single-CPU (successfull) computation run-single.out and multi-CPU (failing) computation run-multi.out (the failing convergence is visible on lines starting by line no. 1031).

Since the infrastructure we run is quite heterogenous, some clusters do have Infiniband interconnection and some do not. Thus, I've decided to use MPI-SPAWN so that such a compilation should be runable on all the machines we run. I hope this is a correct deduction...

Thanks very much for any advice.
Tom Rebok, MetaCentrum NGI, Czech Republic.

PS: Is there anybody, who can test to run the computation above in parallel and let me know whether it converges?

PS2: Maybe, it is somehow related to the problem described here...

Just ran it with 16 processors. I would remove the "io replicated" from the input (that's what I did).

Also, I seriously doubt you can build one binary that will work on all platforms. Also, we have never tested running MPI-SPAWN between heterogeneous nodes on different clusters.


Dear Bert,

thanks a lot! That works!!! :-)

May I ask you for short info, what does the option "io replicated" influence? And why it brokes such type of computations? And when it is (or a different io scheme) suitable to use?

I see. But our clusters are not of different platforms -- all our clusters are "x86_64" architecture (having as much as possible unique OS/SW equipment), and thus I think that such a binary should work for us. And if not, I'll solve it once such a problem appears...



