6:52:51 AM PDT - Wed, Aug 29th 2012 |
|
good day all,
we received a new cluster in that is based on qlogic infiniband, i've spent a couple of days fiddling around with the build and i'm still having issues. system is centOS 6.2 based, gcc 4.4.6, qlogic fabric, QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02). i built openmpi this way:
./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static --without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++ F77=gfortran FC=gfortran --enable-mpi-thread-multiple
my nwchem environment:
export NWCHEM_TOP=/shared/build/nwchem-6.1.1-src
export NWCHEM_MODULES="all"
export INSTALL_PREFIX=/shared/nwchem-6.1.1
export CC=gcc
export FC=gfortran
export MPI_INCLUDE=/shared/openmpi-1.6.1/gcc/include
export MPI_LIB=/shared/openmpi-1.6.1/gcc/lib
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export ARMCI_NETWORK=MPI-MT
export TARGET=LINUX64
export LARGE_FILES=TRUE
export NWCHEM_TARGET=LINUX64
export IB_HOME=/usr
export IB_INCLUDE=/usr/include
export IB_LIB=/usr/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"
nwchem builds and runs on two nodes, but errors after a couple of minutes with this error
12:Segmentation Violation error, status=: 11
(rank:12 hostname:node033 pid:2797):ARMCI DASSERT fail. ../../ga-5-1/armci/src/c
ommon/signaltrap.c:SigSegvHandler():310 cond:0
i looked at this thread,
http://nwchemgit.github.io/Special_AWCforum/st/id435/#post_1562
removed the rpm builds of blas and re linked, same error.
ldd on nwchem binary is:
ldd nwchem
linux-vdso.so.1 => (0x00007fff50351000)
libmpi_f90.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi_f90.so.1 (0x00007f20e6c37000)
libmpi_f77.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi_f77.so.1 (0x00007f20e6a02000)
libmpi.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi.so.1 (0x00007f20e643f000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003e19200000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003e24e00000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003e23a00000)
libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007f20e6139000)
libm.so.6 => /lib64/libm.so.6 (0x0000003e19e00000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003e25e00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003e19600000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e19a00000)
libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x00000033cd200000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00000033cda00000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f20e5f28000)
libpsm_infinipath.so.1 => /usr/lib64/libpsm_infinipath.so.1 (0x00007f20e5cd6000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003e1aa00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003e1a200000)
/lib64/ld-linux-x86-64.so.2 (0x0000003e18e00000)
libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00000033cd600000)
libinfinipath.so.4 => /usr/lib64/libinfinipath.so.4 (0x00007f20e5ac7000)
where else should i be looking should i be building a local blas/etc?
-- michael
|