NWChem seg fault crashes on Centos 7.2


Clicked A Few Times
We are trying to get NWChem 6.6 working with fully patched 64 bit Centos 7.2 systems, but are having some trouble.
It compiles and links cleanly, but seg faults and crashes immediately upon running.

If compiled against OpenMPI, we see:

/usr/lib64/openmpi/bin/mpiexec


mpiexec noticed that process rank 0 with PID 8294 on node moose16 exited on signal 11 (Segmentation fault).




Against MPICH:
/usr/lib64/mpich/bin/mpiexec

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

Program received signal SIGILL: Illegal instruction.

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

Backtrace for this error:
  1. 0 0x2B2E96996467
  2. 0 0x2AAE50407467
  3. 0 0x2ADE4D748467
  4. 0 0x2B32B1BBB467
  5. 1 0x2ADE4D748AAE
  6. 1 0x2AAE50407AAE
  7. 1 0x2B2E96996AAE
  8. 1 0x2B32B1BBBAAE
  9. 2 0x2B2E9782F66F
  10. 2 0x2B32B2A5466F
  11. 2 0x2ADE4E5E166F
  12. 2 0x2AAE512A066F
  13. 3 0x2E6F972 in mxinit_ at mxsubs.F:118
  14. 3 0x2E6F972 in mxinit_ at mxsubs.F:118
  15. 3 0x2E6F972 in mxinit_ at mxsubs.F:118
  16. 3 0x2E6F972 in mxinit_ at mxsubs.F:118
  17. 4 0x4F49C2 in nwchem at nwchem.F:89
  18. 4 0x4F49C2 in nwchem at nwchem.F:89
  19. 4 0x4F49C2 in nwchem at nwchem.F:89
  20. 4 0x4F49C2 in nwchem at nwchem.F:89

=======================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 132
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=======================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Illegal instruction (signal 4)
This typically refers to a problem with your application.


We have tried applying the NWChem patches available on the download page, including the latest ones for Fedora 24, but no joy.

Any hints would be appreciated, thanks!

-Dj

Clicked A Few Times
Followup: Apparently NWChem will only run on the same type of CPU as it is compiled.
We have a mix of AMD and Intel systems on our HPC Grid, and due to current Grid usage,
the job was always getting assigned to an AMD node but the machine that things are compiled on is an Intel box.

If anyone else runs into a similar issue, this might be something to check.

fyi

Clicked A Few Times
I am having the same exact issue. I have tried

1) Using yum to install "yum install nwchem nwchem-openmpi-x86_64 nwchem-mpich-x86_64"

On Intel processors, nwchem runs fine. On AMD64 processors I get the segmentation fault. Otherwise the CentOS7 install is the same.

2) Compiling myself. The code compiles on AMD64 using the CentOS7.1 instructions on the manual page with the following patches

Config_libs66.patch
Ga_argv.patch
Tools_lib64.patch

Otherwise I followed the NWChem Installation instructions for CentOS7.1 and NWChem 6.6.

I have to include the compat-openmpi16 libraries for the libmpi_f77.so.1 and libmpi_f90.so.1
because those are apparently missing from the lib64/opempi libraries

using:

export LD_LIBRARY_PATH=/usr/lib64/compat-openmpi16/lib/:/usr/lib64/openmpi/lib/:$LD_LIBRARY_PATH


I get when trying to run on AMD processors and CentOS7.


Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
  1. 0 0x7F1AEE100467
  2. 1 0x7F1AEE100AAE
  3. 2 0x7F1AECE3966F
  4. 3 0x7F1AEB6EDAFA
  5. 4 0x7F1AEA633FDC
  6. 5 0x7F1AEA633460
  7. 6 0x7F1AEA63392C
  8. 7 0x7F1AEA6342E4
  9. 8 0x7F1AEE44F414
  10. 9 0x7F1AEE44DF9C
  11. 10 0x7F1AEE470512
  12. 11 0x2BC5FFD in tcgi_alt_pbegin
  13. 12 0x2BC60B3 in tcgi_pbegin
  14. 13 0x2BC51E5 in pbeginf_
  15. 14 0x4F4A79 in nwchem at nwchem.F:84


Any help? I've tried to reach out to Marcin Dulak who seems to compile the nwchem rpms but I haven't heard back yet.


If this is helpful here is my cpuinfo:

Processor name  : AMD Opteron(tm) Processor 6136
Packages(sockets) : 4
Cores  : 32
Processors(CPUs)  : 32
Cores per package : 8
Threads per core  : 1


Here is my centos-release

CentOS Linux release 7.2.1511 (Core)


and uname-a for the kernel

Linux 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Thanks in advance to anyone who could help.

Josh

Clicked A Few Times
I am having the same exact issue. I have tried

1) Using yum to install "yum install nwchem nwchem-openmpi-x86_64 nwchem-mpich-x86_64"

On Intel processors, nwchem runs fine. On AMD64 processors I get the segmentation fault. Otherwise the CentOS7 install is the same.

2) Compiling myself. The code compiles on AMD64 using the CentOS7.1 instructions on the manual page with the following patches

Config_libs66.patch
Ga_argv.patch
Tools_lib64.patch

Otherwise I followed the NWChem Installation instructions for CentOS7.1 and NWChem 6.6.

I have to include the compat-openmpi16 libraries for the libmpi_f77.so.1 and libmpi_f90.so.1
because those are apparently missing from the lib64/opempi libraries

using:

export LD_LIBRARY_PATH=/usr/lib64/compat-openmpi16/lib/:/usr/lib64/openmpi/lib/:$LD_LIBRARY_PATH


I get when trying to run on AMD processors and CentOS7.


Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
  1. 0 0x7F1AEE100467
  2. 1 0x7F1AEE100AAE
  3. 2 0x7F1AECE3966F
  4. 3 0x7F1AEB6EDAFA
  5. 4 0x7F1AEA633FDC
  6. 5 0x7F1AEA633460
  7. 6 0x7F1AEA63392C
  8. 7 0x7F1AEA6342E4
  9. 8 0x7F1AEE44F414
  10. 9 0x7F1AEE44DF9C
  11. 10 0x7F1AEE470512
  12. 11 0x2BC5FFD in tcgi_alt_pbegin
  13. 12 0x2BC60B3 in tcgi_pbegin
  14. 13 0x2BC51E5 in pbeginf_
  15. 14 0x4F4A79 in nwchem at nwchem.F:84


Any help? I've tried to reach out to Marcin Dulak who seems to compile the nwchem rpms but I haven't heard back yet.


If this is helpful here is my cpuinfo:

Processor name  : AMD Opteron(tm) Processor 6136
Packages(sockets) : 4
Cores  : 32
Processors(CPUs)  : 32
Cores per package : 8
Threads per core  : 1


Here is my centos-release

CentOS Linux release 7.2.1511 (Core)


and uname-a for the kernel

Linux 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Thanks in advance to anyone who could help.

Josh

Clicked A Few Times
Sorry for the duplicate post

Clicked A Few Times
Is the issue that CentOS7 only has ga 5.3b? Is there a workaround to get it to work with ga 5.3b?

Forum Vet
mpif90 used to compile?
1) What mpif90 have you used to compile? In other words, what is the output of the commands
mpif90 -show
which mpif90

2) Have you set -- by any chance -- any of the following env. variables; MPI_LIB, LIBMPI or MPI_INCLUDE?

3) Are you using any additional library (e.g. Scalapack)?

Clicked A Few Times
mpif90 -show

gfortran -I/usr/include/openmpi-x86_64 -pthread -m64 -I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh -lmpi

which mpif90

/usr/lib64/openmpi/bin/mpif90


echo $MPI_LIB
/usr/lib64/openmpi/lib


echo $MPI_INCLUDE
/usr/include/openmpi-x86_64


I did not have a LIBMPI set per the Centos7.1 install instructions not including it

http://nwchemgit.github.io/index.php/Compiling_NWChem#NWChem_6.6_on_Centos_7.1

I was able to get it compile once and it was running on Intel and AMD cores and in parallel. I patched every single one of the patches. However, when trying again to make sure I knew what I did, I wasn't able to repeat my success. I was getting this

http://nwchemgit.github.io/Special_AWCforum/st/id2013/compiling_conflict.html


I trying now with Marcin Dulak's compilation instructions for the EPEL nwchem for openmpi that come shipped with the EPEL nwchem_openmpi and not including EPEL in the compile.

Forum Vet
I think that your build has some components that require the older OpenMPI libraries (now shipped with compat-openmpi16),
therefore you are forced to add /usr/lib64/compat-openmpi16/lib to your LD_LIBRARY_PATH and this will eventually cause the SIGSEGV crash

I was able to reproduce this behavior in my Centos 7 installation
(in my case if was the ELPA library still using the old 1.6 Openmpi libraries;
once I removed ELPA from my build, I no longer needed compat-openmpi16 and the SIGSEGV vanished).

If you upload the following files to a public website, I might be able to help you

$NWCHEM_TOP/src/tools/build/config.log
$NWCHEM_TOP/src/tools/build/armci/config.log
$NWCHEM_TOP/src/tools/build/comex/config.log

Forum Vet
I think that your build has some components that require the older OpenMPI libraries (now shipped with compat-openmpi16),
therefore you are forced to add /usr/lib64/compat-openmpi16/lib to your LD_LIBRARY_PATH and this will eventually cause the SIGSEGV crash

I was able to reproduce this behavior in my Centos 7 installation (in my case if was the ELPA library still using the old 1.6 Openmpi libraries; once I removed ELPA from my build, I no longer needed compat-openmpi16 and the SIGSEGV vanished).

If you upload the following files to a public website, I might be able to help you

$NWCHEM_TOP/src/tools/build/config.log
$NWCHEM_TOP/src/tools/build/armci/config.log
$NWCHEM_TOP/src/tools/build/comex/config.log

Clicked A Few Times
Thank you very much!

Indeed I was struggling with having to link to the compat-openmpi16 libraries for the libmpi_f77 libmpi_f90 libraries. I removed ELPA as you suggested and then used a modified compile script based on the openmpi compile script Marcin Dulak had with his epel7 nwchem package. I added scalapack.

This got me a compile that seems to work using with "module load mpi/openmpi-x86_64". It runs on CentOS7.2 with NWChem 6.6 and seems to work on both my Intel and AMD nodes.


cd $NWCHEM_TOP
sed -i 's|-march=native||' src/config/makefile.h
sed -i 's|-mtune=native|-mtune=generic|' src/config/makefile.h
sed -i 's|-mfpmath=sse||' src/config/makefile.h
sed -i 's|-msse3||' src/config/makefile.h
patch -p0 < ../Tddft_mxvec20.patch
patch -p0 < ../Tools_lib64.patch
patch -p0 < ../Config_libs66.patch
patch -p0 < ../Cosmo_meminit.patch
patch -p0 < ../Sym_abelian.patch
patch -p0 < ../Xccvs98.patch
patch -p0 < ../Dplot_tolrho.patch
patch -p0 < ../Driver_smalleig.patch
patch -p0 < ../Ga_argv.patch
patch -p0 < ../Raman_displ.patch
patch -p0 < ../Ga_defs.patch
patch -p0 < ../Zgesvd.patch
patch -p0 < ../Cosmo_dftprint.patch
patch -p0 < ../Txs_gcc6.patch
patch -p0 < ../Gcc6_optfix.patch
patch -p0 < ../Util_gnumakefile.patch
patch -p0 < ../Util_getppn.patch
patch -p0 < ../Gcc6_macs_optfix.patch
patch -p0 < ../Notdir_fc.patch
cd $NWCHEM_TOP/src
export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export USE_ARUR=TRUE
export USE_NOFSCHECK=TRUE
export NWCHEM_FSCHECK=N
export LARGE_FILES=TRUE
export MRCC_THEORY=Y
export EACCSD=Y
export IPCCSD=Y
export CCSDTQ=Y
export CCSDTLR=Y
export NWCHEM_LONG_PATHS=Y
export PYTHONHOME=/usr
export PYTHONVERSION=2.7
export PYTHONLIBTYPE=so
export USE_PYTHON64=y
export HAS_BLAS=yes
export BLASOPT='-L/usr/lib64 -lopenblas'
export BLAS_SIZE='4'
export SCALAPACK_SIZE='4'
export SCALAPACK='-L/usr/lib64/openmpi/lib -lscalapack -lmpiblacs'
export MAKE=/usr/bin/make
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
$MAKE nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee ../make_nwchem_config_openmpi.log
$MAKE -j 8 64_to_32 2>&1 | tee ../make_64_to_32_openmpi.log
export MAKEOPTS="USE_64TO32=y"
$MAKE -j 8 ${MAKEOPTS} 2>&1 | tee ../make_nwchem_openmpi.log


Forum >> NWChem's corner >> Compiling NWChem