NWChem computation: multi-CPU run never converges


Click here for full thread
Forum Vet
I do not believe this is a kompile issue but rather a run issue. How much memory do you have per processor (not per node, but per core)? The memory keyword in NWChem is per core. Could you try running on multiple processors with the memory keyword set to maybe 1000 mb at most to see if this works.

What is the hardware you are running on that forces you to use MPI-SPAWN?

Thanks,

Bert


Quote:Jeronimo Jul 20th 2:14 pm
Dear all,

I'm trying to resolve a problem with our user's computations. I've prepared the NWChem 6.0 installation (64-bit systems, Debian 6.0) for him, compiled by GNU compilers (version 4.3.2) with MPI support (OpenMPI 1.4.3) with the following options:


export NWCHEM_TOP=/software/NWChem-6.0/source/nwchem-6.0-src/
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="pnnl"

export PYTHONHOME=$NWCHEM_TOP/../python-2.6.2/
export PYTHONVERSION=2.6
export USE_PYTHON64=y

export ARMCI_NETWORK=MPI-SPAWN
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export MPI_LIB=/software/openmpi-1.4.3/lib/
export MPI_INCLUDE=/software/openmpi-1.4.3/include/

export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE

export FC=gfortran


The NWChem has successfully compiled: when running a single CPU version of the following computation, the computation successfully finishes. BUT, when running a multi-CPU version, the computation never converges (within the CCSD iterations) -- the computation crashes with "maxiter exceeded" and subsequent "ARMCI DASSERT fail (260)" messages (even though a single-CPU version conveges within 20 iterations, the multi-CPU does not converge even running for 1000 iterations).

The computation:

start test

memory 7000 mb

geometry units angstroms
S 0.000000 0.000000 0.000000
S 0.000000 0.000000 1.912540
Cl 1.961192 0.000000 2.612737
end

basis
 S library aug-cc-pVDZ
Cl library aug-cc-pVDZ
end

scf
doublet
rohf
end

tce
scf
ccsd
io replicated
tilesize 15
freeze core atomic
nroots 2
targetsym a'
symmetry
dipole
thresh 1.0d-5
end

task tce energy


I've tried various compilers (GNU, Intel), various MPI libraries (OpenMPI 1.4.3, MPICH2 1.4.1p1) -- even though most of another computations succesfully finish (no matter whether being run on single or multi processors), the above computation behaves equally (badly).

Please, is there somebody who can give me a hint, how to resolve the issue? I don't have any other idea... ;-(

Thanks a lot in advance!
Tom Rebok, Czech Republic.

PS: The build log is available here: make.log