Hi,
I recently attempted to upgrade to version 6.6 on BG/Q system, and to compare the performance using various implementation of AMRCI, MPICH and BLAS.
It has no major problem if 64-bit BLAS is used (either using standard MPI-TS with MPICH2, non-supported MPI-TS with MPICH3, and even AMRCI-MPI3 with MPICH3)
But I got some problem when I used based using ESSL (64_to_32 is required) which haven't been seen for previous versions on the same machine.
(1) some header files in src/util/util_getppn.c is not included in the /bgsys/drivers/ppcfloor/comm/xl/include, e.g.
- if defined(__bgq__)
- include <process.h>
- include <location.h>
- include <personality.h>
...
so the MPI_INCLUDE has to be modify as following to include these headers files, e.g.
================================================
export MPI_INCLUDE="-I/bgsys/drivers/ppcfloor \
-I/bgsys/drivers/ppcfloor/firmware/include \
-I/bgsys/drivers/ppcfloor/spi/include \
-I/bgsys/drivers/ppcfloor/spi/include/kernel \
-I/bgsys/drivers/ppcfloor/spi/include/kernel/cnk \
-I/bgsys/drivers/ppcfloor/comm/xl/include "
================================================
(2) in src/util/util_getppn.c, from line 35 assign *ppn_out on bgq, but it will get an extra "}" after preprocessing
================================================
#if defined(__bgq__)
*ppn_out = Kernel_ProcessCount();
- elif MPI_VERSION >= 3
...
- endif
errlab:
GA_Error(" util_getppn failure", 0);
return;
}
}
================================================
(3) based on the standard MPI-TS procedure, after make 64_to_32 and defined USE_64TO32=yes, but I got a MA error:
================================================
MA fatal error: MA_sizeof: invalid datatype: 4350801871857
================================================
and I noticed that the error should be due to the some of fortran codes in GA (ver 5.4) were built using -qintsize=8 directive. So I edited src/tools/GNUmakefile to force the --enable-i4 is explicitly included for for configure GA, the program starts but crash after detecting the symmetry directive
================================================
NWChem Input Module
-------------------
Scaling coordinates for geometry "geometry" by 1.889725989
(inverse scale = 0.529177249)
Turning off AUTOSYM since
SYMMETRY directive was detected!
2016-03-09 15:51:34.170 (WARN ) [0xfff78848b10] 80299:ibm.runjob.client.Job: terminated by signal 11
2016-03-09 15:51:34.171 (WARN ) [0xfff78848b10] 80299:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 0
================================================
I have not clue for the cause of this error, and can anyone suggest how to fix this problem? Thanks !
This is the setting I used to build MPI-TS
================================================
export NWCHEM_TOP=/scratch/home/chiensh/nwchem/nwchem-6.6
export NWCHEM_TARGET=BGQ
export ARMCI_NETWORK=MPI-TS
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export USE_XLF=y
export USE_I4FLAGS=y
export USE_64TO32=y
export FFLAG_INT="-qintsize=4"
export DISABLE_GAMIRROR=y
export NWCHEM_MODULES="all "
- export MRCC_METHODS=TRUE
- export NWCHEM_MODULES=smallqm
export MPI_INCLUDE="-I/bgsys/drivers/ppcfloor \
-I/bgsys/drivers/ppcfloor/firmware/include \
-I/bgsys/drivers/ppcfloor/spi/include \
-I/bgsys/drivers/ppcfloor/spi/include/kernel \
-I/bgsys/drivers/ppcfloor/spi/include/kernel/cnk \
-I/bgsys/drivers/ppcfloor/comm/xl/include "
export BLAS_SIZE=4
export BLASOPT=" /bgsys/ibm_essl/prod/opt/ibmmath/essl/5.1/lib64/libesslbg.a \
/apps/libs/lapack/essl/lapack_BGQ_XL_ESSL_INT4_.a \
/apps/libs/blas/xl/blas_BGQ_XL_INT4_.a -Wl,-zmuldefs"
export BLAS_LIB=" /bgsys/ibm_essl/prod/opt/ibmmath/essl/5.1/lib64/libesslbg.a \
/apps/libs/lapack/essl/lapack_BGQ_XL_ESSL_INT4_.a \
/apps/libs/blas/xl/blas_BGQ_XL_INT4_.a -zmuldefs "
================================================
Regards,
Dominic Chien
|