Get error: Last System Error Message from Task 0:: No such file or directory


Clicked A Few Times
Hi all,

NWChem-6.8 was recently built on the computing cluster at my institution, and I am trying to run a simple test input file from the NWChem manual just to make sure everything is working. I am attempting to run the following input on a single core:


scratch_dir /n/home03/gstec/tests
permanent_dir /n/home03/gstec/tests

start h2o_freq
charge 1
geometry units angstroms
 O       0.0  0.0  0.0
 H       0.0  0.0  1.0
 H       0.0  1.0  0.0
end
basis
 H library sto-3g
 O library sto-3g
end
scf
 uhf; doublet
 print low
end
title "H2O+ : STO-3G UHF geometry optimization"
task scf optimize
basis
 H library 6-31g**
 O library 6-31g**
end
title "H2O+ : 6-31g** UMP2 geometry optimization"
task mp2 optimize
mp2; print none; end
scf; print none; end
title "H2O+ : 6-31g** UMP2 frequencies"
task mp2 freq


I have run this exact input on my personal computer, and it runs fine (granted, SCF convergence is not reached, but it terminates normally). I then run this through a slurm queue by the following command:


srun -n 1 nwchem water.nw >& test.log


and I get the following error at the end of the output:

  

                                 NWChem SCF Module
                                 -----------------


                      H2O+ : STO-3G UHF geometry optimization



  ao basis        = "ao basis"
  functions       =     7
  atoms           =     3
  alpha electrons =     5
  beta  electrons =     4
  charge          =   1.00
  wavefunction    = UHF
  input vectors   = atomic
  output vectors  = /n/home03/gstec/tests/h2o_freq.movecs
  use symmetry    = T
  symmetry adapt  = T

Last System Error Message from Task 0:: No such file or directory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 11.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------


Details of the cluster and conditions under which it was built:
CentOS7
OpenMPI 2.0.1
GCC Version 7.1.0
Intel Math Kernel Library 2017.2.174

Environmental variables for compiling:

export NWCHEM_TOP="$FASRCSW_DEV"/rpmbuild/BUILD/%{name}-%{version}-release/
export USE_MPI=y
export NWCHEM_TARGET=LINUX64  
export USE_PYTHONCONFIG=y  
export PYTHONVERSION=2.7
export ARMCI_NETWORK=OPENIB
export BLASOPT="-L${MKL_HOME}/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lpthread -lm"
export SCALAPACK="-L${MKL_HOME}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export PYTHONHOME=/usr


Any information on this error and how to resolve it would be greatly appreciated. Please let me know if you need any other information from me.

Forum Vet
Does it make any difference if you start NWChem with mpirun instead of srun?

Clicked A Few Times
Edoapra,

Starting with mpirun gives the following error:


                                 NWChem SCF Module
                                 -----------------


                      H2O+ : STO-3G UHF geometry optimization



  ao basis        = "ao basis"
  functions       =     7
  atoms           =     3
  alpha electrons =     5
  beta  electrons =     4
  charge          =   1.00
  wavefunction    = UHF
  input vectors   = atomic
  output vectors  = /n/home03/gstec/tests/h2o_freq.movecs
  use symmetry    = T
  symmetry adapt  = T


 Forming initial guess at       1.4s

0:Segmentation Violation error, status=: 11
(rank:0 hostname:holy7c13310.rc.fas.harvard.edu pid:178622):ARMCI DASSERT fail. ../../ga-5.6.5/armci/src/common/signaltrap.c:SigSegvHandler():315 cond:0
Last System Error Message from Task 0:: No such file or directory
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 11.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------


Forum Vet
Could you confirm that this is a single processor run by searching for the following line in the output file

   nproc           =       1

Clicked A Few Times
Indeed, this is a single processor run:


           Job information
           ---------------

    hostname        = redacted
    program         = nwchem
    date            = Fri Jan  4 07:54:41 2019

    compiled        = Thu_Sep_20_10:57:14_2018
    source          = redacted
    nwchem branch   = Development
    nwchem revision = 0.0.1-852-g39d600b
    ga revision     = 5.6.5
    use scalapack   = T
    input           = water.nw
    prefix          = h2o_freq.
    data base       = /n/home03/gstec/tests/h2o_freq.db
    status          = startup
    nproc           =        1
    time left       =     -1s



           Memory information
           ------------------

    heap     =   13107198 doubles =    100.0 Mbytes
    stack    =   13107195 doubles =    100.0 Mbytes
    global   =   26214400 doubles =    200.0 Mbytes (distinct from heap & stack)
    total    =   52428793 doubles =    400.0 Mbytes
    verify   = yes
    hardfail = no


           Directory information
           ---------------------

  0 permanent = /n/home03/gstec/tests
  0 scratch   = /n/home03/gstec/tests

Forum Vet
error file
Ryan,
Do you have any other error file? That might help identify the problem.
The only possible fix I have for a similar problem requires recompiling the tools/GA directory

Forum Vet
tools fix
This is the way to implement the fix for the tools directory

cd $NWCHEM_TOP/src/tools
rm -rf ga-5.6.5* build install
wget https://github.com/edoapra/ga/releases/download/v5.6.5/ga-5.6.5.tar.gz
make
cd ..
make link

Clicked A Few Times
I tried that following fix but am still getting the same error. The output is the only error file I have. Is there a way to produce a more verbose output which could help identify the problem?

Forum Vet
Env. variables
I have just spotted some issue in your env. variables.
Please add the following ones

export BLAS_SIZE=8
export SCALAPACK_SIZE=8

Then, recompile

cd $NWCHEM_TOP/src/tools
rm -rf build install
make
cd ..
make link

Clicked A Few Times
Unfortunately setting these two environmental variables as such and recompiling gives the same error when I try to run the job.

Forum Vet
Could you please post the following files on public website

$NWCHEM_TOP/src/tools/build/config.log
$NWCHEM_TOP/src/tools/build/armci/config.log
$NWCHEM_TOP/src/tools/build/comex/config.log

Forum Vet
Development version?
By the way, your output shows that you are not using a release version (e.g. 6.8 or 6.8.1), but a development version, instead

 nwchem branch   = Development

I strongly recommend you to move to the 6.8.1 source


Forum >> NWChem's corner >> Running NWChem