Strange behavior running parallel


Just Got Here
Hello, I'm experienceing the stange issue that I'm not sure it's related to MPI, SGE scheduler or NWChem itself.
When running with 1, 2, 4 or 8 procs on a single node, it runs fine. But when I run with 6 or 12 procs, it failed with the error message below. And for certain input files, I get the same errors when running with a particular number of procs. Can some one explain this? And point me to a direction to troubleshoot this please.
symmetry adapt = T

Here is snippet from the output

Forming initial guess at       1.1s

Error in pstein5. eval  is different on processors 0 and 1 
Error in pstein5. me = 0 exiting via pgexit.
Error in pstein5. eval is different on processors 1 and 0
Error in pstein5. me = 1 exiting via pgexit.
Last System Error Message from Task 1:: Inappropriate ioctl for device
Last System Error Message from Task 0:: Inappropriate ioctl for device
 ME =                      0  Exiting via 
0:0: peigs error: mxpend:: 0
(rank:0 hostname:node13 pid:13469):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
 ME =                      1  Exiting via 
1:1: peigs error: mxpend:: 0
(rank:1 hostname:node13 pid:13470):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0


MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.


forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libmkl_sequential 00002AD009ED6150 Unknown Unknown Unknown
Last System Error Message from Task 2:: Inappropriate ioctl for device
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
nwchem 0000000002FF271E Unknown Unknown Unknown
nwchem 0000000002FF11B6 Unknown Unknown Unknown
nwchem 0000000002F939B2 Unknown Unknown Unknown
nwchem 0000000002F4135B Unknown Unknown Unknown
nwchem 0000000002F46E53 Unknown Unknown Unknown
nwchem 0000000002EB5F3F Unknown Unknown Unknown
nwchem 0000000002E9108F Unknown Unknown Unknown
libc.so.6 000000309A432920 Unknown Unknown Unknown
libmpi.so.1 00002B3007E26A99 Unknown Unknown Unknown
libmpi.so.1 00002B3007D594C2 Unknown Unknown Unknown
mca_coll_tuned.so 00002B300DB4F8EE Unknown Unknown Unknown
mca_coll_tuned.so 00002B300DB58618 Unknown Unknown Unknown
libmpi.so.1 00002B3007D680FD Unknown Unknown Unknown
nwchem 0000000002E125A0 Unknown Unknown Unknown
nwchem 0000000002E72CB2 Unknown Unknown Unknown
nwchem 0000000002E4471B Unknown Unknown Unknown
nwchem 00000000009AA79E Unknown Unknown Unknown
nwchem 00000000009C7347 Unknown Unknown Unknown
nwchem 00000000009ACA49 Unknown Unknown Unknown
nwchem 00000000005B944A Unknown Unknown Unknown
nwchem 0000000000501C57 Unknown Unknown Unknown
nwchem 000000000050118B Unknown Unknown Unknown
nwchem 000000000064BE1F Unknown Unknown Unknown
nwchem 00000000005049C1 Unknown Unknown Unknown
nwchem 00000000004F17A2 Unknown Unknown Unknown
nwchem 00000000004E639B Unknown Unknown Unknown
nwchem 00000000004E5E7C Unknown Unknown Unknown
libc.so.6 000000309A41ECDD Unknown Unknown Unknown
nwchem 00000000004E5D79 Unknown Unknown Unknown
Last System Error Message from Task 3:: Inappropriate ioctl for device
forrtl: error (78): process killed (SIGTERM)

Forum Vet
Compilation settings
Asa,
Could you please describe your compilation settings?

Just Got Here
Quote:Edoapra Nov 20th 10:15 am
Asa,
Could you please describe your compilation settings?


I compiled with intel 2013 and openmpi 1.6.3 on RHEL6. Here are the settings

export NWCHEM_TOP=/usr/local/NWChem-6.1.1
export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TARGET=LINUX64
export USE_MPI=y
export USE_MPIF=y
export NWCHEM_MODULES=all
export USE_MPIF4=y
export MPI_LOC=/usr/local/openmpi/1.6.3/enet/intel13
export MPI_LIB=/usr/local/openmpi/1.6.3/enet/intel13/lib
export MPI_INCLUDE=/usr/local/openmpi/1.6.3/enet/intel13/include
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export USE_NOFSCHECK=TRUE
export HAS_BLAS=y
export MKL64=/usr/local/intel/mkl/lib/intel64
export BLASOPT="-L${MKL64} -Wl,--start-group -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -Wl,--end-group"
make nwchem_config
export NWCHEM_MODULES="all"
make FC=ifort CC=gcc

And this is from a auto generated configuration file (nwchem_config.h)
  1. This configuration generated automatically on grawp at Thu Nov 15 18:07:54 EST 2012
  2. Request modules from user: all
NW_MODULE_SUBDIRS = NWints atomscf ddscf gradients moints nwdft rimp2 stepper driver optim cphf ccsd vib mcscf prepar esp hessian selci dplot mp2_grad qhop property nwpw fft analyz nwmd cafe space drdy vscf qmmm qmd etrans tce bq cons perfm dntmc dangchang ccca
NW_MODULE_LIBS = -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lcons -lperfm -ldntmc -lccca
EXCLUDED_SUBDIRS = develop scfaux plane_wave oimp2 rimp2_grad python argos diana uccsdt rism geninterface transport smd nbo leps
CONFIG_LIBS =

Forum Vet
Intel 2013
Asa,
We have a limited experience with ifort from the Intel 2013 bundle,
but we have already seen quite a few routines generating incorrect results.
If you have access to an older version of Intel compilers, I would suggest you to switch to it or,
as a second -- even safer -- alternative, you might want to use gfortran.

Just Got Here
Quote:Edoapra Nov 20th 2:20 pm
Asa,
We have a limited experience with ifort from the Intel 2013 bundle,
but we have already seen quite a few routines generating incorrect results.
If you have access to an older version of Intel compilers, I would suggest you to switch to it or,
as a second -- even safer -- alternative, you might want to use gfortran.


Thanks, I will try gfortran. I have limited experience with nwchem, does the error below mean anything, just curious..?
Error in pstein5. eval is different on processors 0 and 1

Forum Vet
Quote:Asa Nov 20th 2:30 pm
I have limited experience with nwchem, does the error below mean anything, just curious..?
Error in pstein5. eval is different on processors 0 and 1


It means that the parallel eigensolver is failing. Either the installation of the eigensolver has problems or the matrix that needs
to be diagonalized is not what is supposed to be.

Forum Vet
Patch Needed
Asa
I was able to reproduce your failure. It is caused by the optimized code generated by the Intel Fortran Compiler 13.0
Please apply the patch below to the NWChem 6.1.1 source code and recompile.
http://nwchemgit.github.io/images/Dstebz3.patch.gz

Cheers, Edo

Just Got Here
I got it running ok with intel 12. I will try again with intel 13. Thank you very much for your effort. You ROCK!

Quote:Edoapra Nov 30th 3:55 pm
Asa
I was able to reproduce your failure. It is caused by the optimized code generated by the Intel Fortran Compiler 13.0
Please apply the patch below to the NWChem 6.1.1 source code and recompile.
http://nwchemgit.github.io/images/Dstebz3.patch.gz

Cheers, Edo


Forum >> NWChem's corner >> Running NWChem