At least I can *build* in a similar setup - I built NWChem 6.1.1 in our IBM Power755 linux (SUSE).
I can run a serial job, but parallel jobs via POE crash. My build is perhaps not good enough, but I figured it's still worth sharing (to ask for help from my side too)
The environment variables I have are:
export NWCHEM_TOP=/hpc/home/seb56/pkg/nwchem-6.1.1-p7linux/
export NWCHEM_BASIS_LIBRARY=/usr/local/pkg/nwchem/nwchem-6.1.1/data/libraries/
export MP_HOSTFILE=/hpc/home/seb56/host.list
export CC="mpcc"
export F77="mpfort"
export MPICC="mpcc"
export MPIF77="mpfort"
export MPIEXEC=poe
export ARMCI_NETWORK=OPENIB
export IB_INCLUDE=/usr/include/infiniband
export MSG_COMMS=MPI
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export MPI_LIB=/opt/ibmhpc/ppe.poe/lib64/
export MPI_INCLUDE='/opt/ibmhpc/ppe.poe/include/ibmmpi/ -I/opt/ibmhpc/ppe.poe/include/ibmmpi/thread64'
export LIBMPI="-L${MPI_LIB} -lmpi_ibm"
export NWCHEM_TARGET=LINUX64
export NWCHEM_TARGET_CPU=ppc64
export NWCHEM_MODULES=all
export CFLAGS='-qtune=pwr7 -qarch=pwr7 -q64 -qhot'
export FFLAGS=$CFLAGS
export CFLAGS_FORGA=$CFLAGS
export FFLAGS_FORGA=$CFLAGS
0. make nwchem_config
And I took the following steps.
1. Edit $NWCHEM_TOP/src/config/makefile.h such that
(line 1924) CC=xlc
(line 1940) FOPTIMIZE= -O2 -qstrict -qarch=auto -qtune=auto -qcache=auto -qfloat=fltint (-O3 hangs at some stage)
2. Open $NWCHEM_TOP/src/nwpw/nwpwlib/Parallel/Parallel-tcgmsg.F and replace /* ... */ with **
(line 825) * *determine psr - should be made w/o using tmp array! */
(line 964) * *determine psr - should be made w/o using tmp array! */
3. Back to $NWCHEM_TOP/src and enter "make FC=xlf CC=xlc"
Sometimes it will get stuck at:
checking for fork... yes
Open another terminal and "ps -aux |grep conftest"
Kill the one with ./conftest, by (kill -KILL pid) not (poe ./conftest), which will wake up the build process.
When it fails and complains about *.fh files, do the following
cp $NWCHEM_TOP/src/util/*.fh $NWCHEM_TOP/src/include
4. Back to $NWCHEM_TOP/src and enter "make FC=xlf CC=xlc" to carry on the build.
5. Towards the end of build process, it will build "nwchem" executable by
xlf -q64 -qextname -qfixed -NQ40000 -NT80000 -qmaxmem=8192 -qxlf77=leadzero -qintsize=8 -O2 -g -L/hpc/home/seb56/pkg/nwchem-6.1.1-p7linux//lib/LINUX64_ppc64 -L/hpc/home/seb56/pkg/nwchem-6.1.1-p7linux//src/tools/install/lib -o /hpc/home/seb56/pkg/nwchem-6.1.1-p7linux//bin/LINUX64_ppc64/nwchem nwchem.o stubs.o -lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lcons -lperfm -ldntmc -lccca -lnwcutil -lga -lpeigs -lperfm -lcons -lbq -lnwcutil -llapack -lblas -L/opt/ibmhpc/ppe.poe/lib64/ -libverbs
which will fail due to referencing mpi calls. I replaced "xlf" in the line above with "mpfort" and re-ran the following command.
mpfort -q64 -qextname -qfixed -NQ40000 -NT80000 -qmaxmem=8192 -qxlf77=leadzero -qintsize=8 -O2 -g -L/hpc/home/seb56/pkg/nwchem-6.1.1-p7linux//lib/LINUX64_ppc64 -L/hpc/home/seb56/pkg/nwchem-6.1.1-p7linux//src/tools/install/lib -o /hpc/home/seb56/pkg/nwchem-6.1.1-p7linux//bin/LINUX64_ppc64/nwchem nwchem.o stubs.o -lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lcons -lperfm -ldntmc -lccca -lnwcutil -lga -lpeigs -lperfm -lcons -lbq -lnwcutil -llapack -lblas -L/opt/ibmhpc/ppe.poe/lib64/ -libverbs
it compiled correctly.
Now, I have nwchem built in
$NWCHEM_TOP/bin/LINUX64_ppc64/nwchem
I followed the General site installation. The output of the ldd indicates that the executable is correctly linked to the mpi_ibm, poe, bibverbs etc.
$ldd /usr/local/bin/nwchem
linux-vdso64.so.1 => (0x0000040000040000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00000400000a0000)
libmpi_ibm.so => /usr/lib64/libmpi_ibm.so (0x00000400000d0000)
libpoe.so => /usr/lib64/libpoe.so (0x0000040000380000)
liblapi.so => /usr/lib64/liblapi.so (0x00000400003e0000)
libxlf90_r.so.1 => /opt/ibmcmp/lib64/libxlf90_r.so.1 (0x0000040000620000)
libxlomp_ser.so.1 => /opt/ibmcmp/lib64/libxlomp_ser.so.1 (0x0000040000d50000)
libxlfmath.so.1 => /opt/ibmcmp/lib64/libxlfmath.so.1 (0x0000040000d70000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000040000d90000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000040000dc0000)
librt.so.1 => /lib64/power7/librt.so.1 (0x0000040000de0000)
libpthread.so.0 => /lib64/power7/libpthread.so.0 (0x0000040000e00000)
libm.so.6 => /lib64/power7/libm.so.6 (0x0000040000e40000)
libc.so.6 => /lib64/power7/libc.so.6 (0x0000040000f10000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000400010f0000)
/lib64/ld64.so.1 (0x0000040000000000)
If I run a test script
$poe nwchem test.nw
This works as expected and computes the test correctly.
If I run 2 processes
$poe nwchem test.nw -procs 2
It instantly fails as shown below.
$poe nwchem test.nw -procs 2
argument 1 = test.nw
0:Segmentation Violation error, status=: 11
(rank:0 hostname:p1n12-c pid:116227):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common /signaltrap.c:SigSegvHandler():310 cond:0
nwchem[0x12472618]
nwchem[0x124c83e8]
[0x40000040418]
nwchem[0x124861c4]
nwchem[0x1252ec6c]
nwchem[0x124c4c18]
nwchem[0x12481174]
nwchem[0x12481f88]
nwchem[0x1247f2f0]
nwchem[0x124a27f0]
nwchem[0x124735bc]
nwchem[0x1247f230]
nwchem[0x124f9de8]
nwchem[0x123a0f00]
nwchem[0x10008790]
/lib64/power7/libc.so.6(+0x4f05c)[0x40000f5f05c]
/lib64/power7/libc.so.6(__libc_start_main-0x16ea7c)[0x40000f5f27c]
Last System Error Message from Task 0:: No such file or directory
ERROR: 0031-250 task 0: Terminated
(a while later...)
ERROR: 0032-171 Communication subsystem error: 2660-413 Communication timeout has occurred. in MPI_Recv, task 1
ERROR: 0032-171 Communication subsystem error: 2660-413 Communication timeout has occurred. in routine unknown, task 1
The backtrace that looks like nwchem[0x.......]... were enabled by editing /src/tools/ga=-5-1/armci/src/common/armci.c
(line 29) #define PRINT_BT
(line 986)
#if defined(PRINT_BT)
void *bt[100];
backtrace_symbols_fd(bt, backtrace(bt, 100), 2);
#endif
Can anyone help please?
Thanks
Sung
==
--
Sung Eun Bae, Ph.D
Supercomputing Services and Support Consultant
BlueFern
University of Canterbury
Private Bag 4800
Christchurch 8140
New Zealand
http://www.bluefern.canterbury.ac.nz
Tel: +64 3 364 2987 ext 43070
Mobile: +64 21 238 1420
Fax: +64 3 364 3002
|