having some problems with 6.1.1 and qlogic openmpi


Jump to page 12Next 16Last
Clicked A Few Times
good day all,

we received a new cluster in that is based on qlogic infiniband, i've spent a couple of days fiddling around with the build and i'm still having issues. system is centOS 6.2 based, gcc 4.4.6, qlogic fabric, QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02). i built openmpi this way:

./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static --without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++ F77=gfortran FC=gfortran --enable-mpi-thread-multiple

my nwchem environment:

export NWCHEM_TOP=/shared/build/nwchem-6.1.1-src
export NWCHEM_MODULES="all"
export INSTALL_PREFIX=/shared/nwchem-6.1.1
export CC=gcc
export FC=gfortran
export MPI_INCLUDE=/shared/openmpi-1.6.1/gcc/include
export MPI_LIB=/shared/openmpi-1.6.1/gcc/lib
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export ARMCI_NETWORK=MPI-MT
export TARGET=LINUX64
export LARGE_FILES=TRUE
export NWCHEM_TARGET=LINUX64
export IB_HOME=/usr
export IB_INCLUDE=/usr/include
export IB_LIB=/usr/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"

nwchem builds and runs on two nodes, but errors after a couple of minutes with this error

12:Segmentation Violation error, status=: 11
(rank:12 hostname:node033 pid:2797):ARMCI DASSERT fail. ../../ga-5-1/armci/src/c
ommon/signaltrap.c:SigSegvHandler():310 cond:0

i looked at this thread,

http://nwchemgit.github.io/Special_AWCforum/st/id435/#post_1562

removed the rpm builds of blas and re linked, same error.

ldd on nwchem binary is:


ldd nwchem
linux-vdso.so.1 => (0x00007fff50351000)
libmpi_f90.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi_f90.so.1 (0x00007f20e6c37000)
libmpi_f77.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi_f77.so.1 (0x00007f20e6a02000)
libmpi.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi.so.1 (0x00007f20e643f000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003e19200000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003e24e00000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003e23a00000)
libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007f20e6139000)
libm.so.6 => /lib64/libm.so.6 (0x0000003e19e00000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003e25e00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003e19600000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e19a00000)
libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x00000033cd200000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00000033cda00000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f20e5f28000)
libpsm_infinipath.so.1 => /usr/lib64/libpsm_infinipath.so.1 (0x00007f20e5cd6000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003e1aa00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003e1a200000)
/lib64/ld-linux-x86-64.so.2 (0x0000003e18e00000)
libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00000033cd600000)
libinfinipath.so.4 => /usr/lib64/libinfinipath.so.4 (0x00007f20e5ac7000)

where else should i be looking should i be building a local blas/etc?

-- michael

Forum Vet
Michael
Could you please post (or put on a website) the following

1) Full output file

2) Full link step (that is the the result of "cd $NWCHEM_TOP/src; make FC=gfortran link"

Clicked A Few Times
output file is here

http://pastebin.com/1WGsJ1RT

and the output from the relink:


[root@cmbcluster src]# make FC=gfortran link
make nwchem.o stubs.o
make[1]: warning: -jN forced in submake: disabling jobserver mode.
gfortran -fdefault-integer-8 -Wextra -Wuninitialized -g -O -I. -I/shared/build/nwchem-6.1.1-src/src/include -I/shared/build/nwchem-6.1.1-src/src/tools/install/include -DEXT_INT -DLINUX -DLINUX64 -DGFORTRAN -DCHKUNDFLW -DGCC4 -DGCC46 -DPARALLEL_DIAG -DCOMPILATION_DATE="'`date +%a_%b_%d_%H:%M:%S_%Y`'" -DCOMPILATION_DIR="'/shared/build/nwchem-6.1.1-src'" -DNWCHEM_BRANCH="'6.1.1'" -c -o nwchem.o nwchem.F
gfortran -fdefault-integer-8 -Wextra -Wuninitialized -g -O -I. -I/shared/build/nwchem-6.1.1-src/src/include -I/shared/build/nwchem-6.1.1-src/src/tools/install/include -DEXT_INT -DLINUX -DLINUX64 -DGFORTRAN -DCHKUNDFLW -DGCC4 -DGCC46 -DPARALLEL_DIAG -DCOMPILATION_DATE="'`date +%a_%b_%d_%H:%M:%S_%Y`'" -DCOMPILATION_DIR="'/shared/build/nwchem-6.1.1-src'" -DNWCHEM_BRANCH="'6.1.1'" -c -o stubs.o stubs.F
gfortran -L/shared/build/nwchem-6.1.1-src/lib/LINUX64 -L/shared/build/nwchem-6.1.1-src/src/tools/install/lib -o /shared/build/nwchem-6.1.1-src/bin/LINUX64/nwchem nwchem.o stubs.o -lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lcons -lperfm -ldntmc -lccca -lnwcutil -lga -lpeigs -lperfm -lcons -lbq -lnwcutil -llapack -lblas -L/shared/openmpi-1.6.1/gcc/lib -lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil
/usr/bin/ld: Warning: alignment 16 of symbol `cface_' in /shared/build/nwchem-6.1.1-src/lib/LINUX64/libstepper.a(stpr_face.o) is smaller than 32 in /shared/build/nwchem-6.1.1-src/lib/LINUX64/libstepper.a(stpr_partit.o)

thanks for having a look! :-)

Forum Vet
Michael,
Everything looks good so far.
Before moving to a more detailed analysis,
I would like to know if the simple $NWCHEM_TOP/src/nwchem.nw test works fine using more than one node.

Cheers, Edo

Clicked A Few Times
same:

http://pastebin.com/sMAKgWZH

process left 12 processes on 2 nodes running.

Forum Vet
Does NWChem run on a single core
Michael,
Does NWChem run on a single core?

Clicked A Few Times
you mean non-mpi? single thread on a node correct?

this runs to completion:

/shared/openmpi-1.6.1/gcc/bin/mpirun -n 1 /shared/nwchem-6.1.1/bin/nwchem /home/mgx/testing/nwchem.nw

http://pastebin.com/nst5ga8n

Forum Vet
What about multiples processes on the same node? E.g.

mpirun -np 2

Clicked A Few Times
yup, fine, up to the number of cores (12).


/shared/openmpi-1.6.1/gcc/bin/mpirun -n 12 /shared/nwchem-6.1.1/bin/nwchem /home/mgx/testing/nwchem.nw


completes fine

Forum Vet
ompi_info
Michael,
1) Did you check from the ompi_info output that OpenMPI was correctly built for multi-threading?
"ompi_info | grep Thread" should show "Thread support: posix (mpi: yes, progress: no)"

2) Let's check if GA/ARMCI built correctly. Could you please post the following
a) $NWCHEM_TOP/src/tools/build/config.log
b) $NWCHEM_TOP/src/tools/build/armci/config.log

Clicked A Few Times
yup, here:

1:)
[mgx@cmbcluster testing]$ ompi_info | grep -i Thread
         Thread support: posix (MPI_THREAD_MULTIPLE: yes, progress: no)
FT Checkpoint support: no (checkpoint thread: no)

2:)
a) http://pastebin.com/qwbLzJ32
b) http://pastebin.com/4A6UCuX8

Forum Vet
mpirun
Michael,
Let's try what happens if we have openmpi using the slower ethernet device. Please add the following options to mpirun

--mca btl tcp,self,sm  --mca btl_tcp_if_include eth0

Clicked A Few Times
nope :-\

full output is here:

http://pastebin.com/d48Dxqc4

12:Segmentation Violation error, status=: 11
(rank:12 hostname:node033 pid:10675):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
Last System Error Message from Task 12:: No such file or directory
^Cmpirun: killing job...


my namd2 build runs fine with this mpi, fwiw

Clicked A Few Times
i rechecked my openmpi build and the examples hello_xxx and ring_xxx work as expected. do you think this is an mpi issue or a GA issue?

--- michael

Forum Vet
Michael,
I think it is most likely a GA issue with the current source code. The next thing we could do is to start to debug the code where the segv occurs, but I am not sure we will get much out of it.
There is a new GA implementation on top of MPI in the works and it might be easier to install on the QLogic hardware. My suggestion for you would be to wait for this new GA to be released.
A major problem of this effort to port NWChem to QLogic is that we do not have access to it and,
as you can see, trying to help remotely is not always straightforward.
Let me know how do you want to proceed.
Cheers, Edo

Clicked A Few Times
thats fine, we can bide our time, are you still at ORNL, i could arrange access to the qlogic cluster, if that would help.

we appreciate your effort in helping!


michael


Forum >> NWChem's corner >> Compiling NWChem
Jump to page 12Next 16Last