CentOS 5.7, RHEL5.2 NWChem 6.1 64 bit will build ok, but crash when run causing seg violation


Clicked A Few Times
Hi, I am new to the forum though use NWChem 6.0 quite intensively. New DFT features in 6.1 are of interest but jobs always crash with Segmentation Violation error :

"0:Segmentation Violation error, status=: 11 
(rank:0 hostname:lasso.bw02.fiu.edu pid:8132):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0 rank 0 in job 2 lasso.bw02.fiu.edu_38648 caused collective abort of all ranks exit status of rank 0: return code 11 "

I have tried mpich2,openmpi,mvapich2 but always finish with this error (c2h4 test: nw memory was set in a range 1Gb - 256 mb for 8cpu node with 8Gb) . The compilation script is (both mpich2 and NWChem6.1 were compiled using Intel v. 12)  :


export LARGE_FILES=TRUE
echo LARGE_FILES=$LARGE_FILES
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export ENABLE_COMPONENT=yes
export TCGRSH=/usr/bin/ssh

export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y

export MPI_HOME=$HOME/mpich2
export MPI_LOC=$MPI_HOME
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI=" -L/${MPI_LIB} -lmpich -lopa -lmpl -lpthread -lrt"

make nwchem_config
make



There is a difference in job log files between 6.1 and 6.0.
6.0 job log starts with "ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets argument 1 = 1.nw"
6.1 job log starts with " argument 1 = 1.nw"

Need help,
regards

Forum Vet
Looks like the 6.0 version was compiled over TCP/IP sockets and is run using "parallel.x". 6.1 seems to be a serial version somehow.

Some more info is needed:

1. What kind of platform are you compiling on?

2. We need more of the output to understand here it fails. Also, search for nproc in the output, how many procs is it using?

Bert


[QUOTE=Fiu chemistry Apr 25th 3:48 pm]Hi, I am new to the forum  though use  NWChem 6.0 quite intensively. New DFT features in 6.1 are of interest but jobs always crash with Segmentation Violation error :

"0:Segmentation Violation error, status=: 11 
(rank:0 hostname:lasso.bw02.fiu.edu pid:8132):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0 rank 0 in job 2 lasso.bw02.fiu.edu_38648 caused collective abort of all ranks exit status of rank 0: return code 11 "

I have tried mpich2,openmpi,mvapich2 but always finish with this error (c2h4 test: nw memory was set in a range 1Gb - 256 mb for 8cpu node with 8Gb) . The compilation script is (both mpich2 and NWChem6.1 were compiled using Intel v. 12)  :


export LARGE_FILES=TRUE
echo LARGE_FILES=$LARGE_FILES
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export ENABLE_COMPONENT=yes
export TCGRSH=/usr/bin/ssh

export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y

export MPI_HOME=$HOME/mpich2
export MPI_LOC=$MPI_HOME
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI=" -L/${MPI_LIB} -lmpich -lopa -lmpl -lpthread -lrt"

make nwchem_config
make



There is a difference in job log files between 6.1 and 6.0.
6.0 job log starts with "ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets argument 1 = 1.nw"
6.1 job log starts with " argument 1 = 1.nw"

Need help,
regards

Clicked A Few Times
Quote:Bert Apr 25th 6:16 pm
Looks like the 6.0 version was compiled over TCP/IP sockets and is run using "parallel.x". 6.1 seems to be a serial version somehow.

Some more info is needed:

1. What kind of platform are you compiling on?

2. We need more of the output to understand here it fails. Also, search for nproc in the output, how many procs is it using?

Bert



Hi, Bert. Thank you for the prompt reply.



1. Platform information:
OS- CentOS 5.7
Intel Dual Socket Motherboard with Intel 5520 Chipset
Intel Xeon E5645 / 6 Core Processors / 2.4GHz / 5.86 QPI GT/sec
48GB DDR3 1066 Mhz ECC Registered System Memory
10/100/1000 BaseT Gigabit Network Adapter on Motherboard
Single Port DDR InfiniBand HCA Mellanox Technologies MT25204 [InfiniHost III Lx HCA]

Custom installation of CBeST v. 3.0 Beowulf cluster software including:
Setup of OS including Custom Kernel for Performance Optimization
Message Passing Libraries (MPI CH, MPI CH2, Open MPI)
TORQUE Batch Scheduling System
Ganglia Cluster Monitoring Utility and Custom Scripting
System Imager for Remote Node Installation, Updating and Addition
PowerScripts for Remote Cluster or Node Reboot, Remote Cluster or Node Shutdown
Security Patches (Port Mapper, IP Chains/Tables) and Latest Security Updates)
Scientific Libraries (LAPACK & BLAS)
GNU Compilers (C & Fortran)



2. JOb-log files contatins nproc= to the number defined in PBS. (nproc > 1)



3. I get this from configuration output (probably it could be useful):
Aggregate Remote Memory Copy Interface (ARMCI) configured as follows:
configure: **************************************************************
configure:
configure: TARGET=LINUX64
configure: MSG_COMMS=TCGMSGMPI
configure: GA_MP_LIBS= -lmpich -lopa -lmpl -lpthread -lrt
configure: GA_MP_LDFLAGS= -L/home/morozov/mpich2/lib -L//home/morozov/mpich2/lib -L/home/morozov/mpich2/lib
configure: GA_MP_CPPFLAGS= -I/home/morozov/mpich2/include -I/home/morozov/mpich2/include
configure: ARMCI_NETWORK=SOCKETS
configure: ARMCI_NETWORK_LDFLAGS= configure: ARMCI_NETWORK_LIBS= configure: ARMCI_NETWORK_CPPFLAGS= configure: F77=/opt/intel/fce/11.1/bin/intel64/ifort
configure: FFLAGS=
configure: FFLAG_INT=-integer-size 64
configure: ARMCI_FOPT=-O3 -w -cm -xW -tpp7
configure: CC=/opt/intel/cce/11.1/bin/intel64/icc
configure: CFLAGS= configure: ARMCI_COPT= configure: CPP=/opt/intel/cce/11.1/bin/intel64/icc -E
configure: CPPFLAGS= configure: LDFLAGS= configure: LIBS= configure: FLIBS= -L/opt/intel/fce/11.1/lib/intel64 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../.. -L/lib64 -L/lib -L/usr/lib64 -L/usr/lib -lifport -lifcore -limf -lsvml -lm -lipgo -lirc -lpthread -lirc_s -ldl
configure: BLAS_LDFLAGS= configure: BLAS_LIBS=-lblas
configure: BLAS_CPPFLAGS= configure: AR=ar
configure: AR_FLAGS=cru
configure: CCAS=/opt/intel/cce/11.1/bin/intel64/icc
configure: CCAS_FLAGS= configure: DEFS=-DHAVE_CONFIG_H
configure: SHELL=/bin/sh
configure: MPIEXEC=/opt/mpich2/ch3_mrail_gen2-intel11/bin/mpirun -n %NP%
configure: NPROCS=4
configure:
configure:
configure: **************************************************************
configure: Global Arrays (GA) configured as follows:
configure: **************************************************************
configure:
configure: TARGET=LINUX64
configure: MSG_COMMS=TCGMSGMPI
configure: GA_MP_LIBS= -lmpich -lopa -lmpl -lpthread -lrt
configure: GA_MP_LDFLAGS= -L/home/morozov/mpich2/lib -L//home/morozov/mpich2/lib -L/home/morozov/mpich2/lib
configure: GA_MP_CPPFLAGS= -I/home/morozov/mpich2/include -I/home/morozov/mpich2/include
configure: ARMCI_NETWORK=SOCKETS
configure: ARMCI_NETWORK_LDFLAGS= configure: ARMCI_NETWORK_LIBS= configure: ARMCI_NETWORK_CPPFLAGS= configure: F77=/opt/intel/fce/11.1/bin/intel64/ifort
configure: FFLAGS=
configure: FFLAG_INT=-integer-size 64
configure: GA_FOPT=-O3 -w -cm -xW -tpp7
configure: CC=/opt/intel/cce/11.1/bin/intel64/icc
configure: CFLAGS= configure: GA_COPT= configure: CPP=/opt/intel/cce/11.1/bin/intel64/icc -E
configure: CPPFLAGS= configure: LDFLAGS= configure: LIBS= configure: FLIBS= -L/opt/intel/fce/11.1/lib/intel64 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../.. -L/lib64 -L/lib -L/usr/lib64 -L/usr/lib -lifport -lifcore -limf -lsvml -lm -lipgo -lirc -lpthread -lirc_s -ldl
configure: BLAS_LDFLAGS= configure: BLAS_LIBS=-lblas
configure: BLAS_CPPFLAGS= configure: AR=ar
configure: AR_FLAGS=cru
configure: CCAS=/opt/intel/cce/11.1/bin/intel64/icc
configure: CCAS_FLAGS= configure: DEFS=-DHAVE_CONFIG_H
configure: SHELL=/bin/sh
configure: MPIEXEC=/opt/mpich2/ch3_mrail_gen2-intel11/bin/mpirun -n %NP%
configure: NPROCS=4

Regards,
Alex

Clicked A Few Times
Bert, i have send config.log on bert.dejong@pnnl.gov.
Thank you for help.
Alex

Clicked A Few Times
"Alex, It seems there is an incompatibility with the blas an lapack libraries that are linked in with the 64-bit version. What I have learned is that on some linux distributions the internal blas and lapack libraries are 32-bit. Hence, when you link against a 64-bit code this causes issues. To assess if this is the case in your 64-bit binaries you should do an ldd on your binary and send me the info.
Bert "

Bert,
You were right. The error was caused by blas/lapack librraries.
NWChem code compiled with Intel mkl works fine.


MKL_HOME="/opt/intel/composerxe/mkl"
MKL_LIB="$MKL_HOME/lib/intel64";
MKL_INCLUDE="$MKL_HOME/include:$MKL_HOME/include/intel64";

export HAS_BLAS=yes
export BLASOPT=" -L$MKL_LIB "



1. Successful script


export NWCHEM_TOP=${HOME}/nwchem/6.1/tcp
export LARGE_FILES=TRUE
echo LARGE_FILES=$LARGE_FILES
export NWCHEM_TARGET=LINUX64
echo NWCHEM_TARGET=$NWCHEM_TARGET
export NWCHEM_MODULES=all
echo NWCHEM_MODULES=$NWCHEM_MODULES
export ENABLE_COMPONENT=yes
echo ENABLE_COMPONENT=$ENABLE_COMPONENT
export TCGRSH=/usr/bin/ssh
echo TCGRSH=$TCGRSH
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y

MKL_HOME="/opt/intel/composerxe/mkl"
MKL_LIB="$MKL_HOME/lib/intel64";
MKL_INCLUDE="$MKL_HOME/include:$MKL_HOME/include/intel64";

export HAS_BLAS=yes
export BLASOPT=" -L$MKL_LIB "
export MPI_HOME=$HOME/mpich2
export MPI_LOC=$MPI_HOME
echo MPI_HOME=$MPI_HOME
echo MPI_LOC=$MPI_LOC

export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
echo MPI_LIB=$MPI_LOC/lib
echo MPI_INCLUDE=$MPI_LOC/include
export LIBMPI=" -L/${MPI_LIB} -lmpich -lopa -lmpl -lpthread -lrt"
export FPATH=$FPATH:$MKL_INCLUDE

cd $NWCHEM_TOP/src
make realclean
make nwchem_config
make


Forum >> NWChem's corner >> Compiling NWChem