NWchem with Gotoblas2 but poor performance


Click here for full thread
Clicked A Few Times
Hi,

The cluster I'm working on is equipped with 720-core/1440-GB (60 dual-cpu/six-core/24GB nodes) 64-bit Xeon Cluster (from Dell), with InfiniBand network fabric. So each node contains 12 cpus.

I managed to succesfully compile and install nwchem 6.3 with GotoBlas2 and scalapack included (using mvapich). However I have seen 2 major problems.

1) I've run a test case for a simple NEB calculation. With internal BLAS and LAPACK library as compiled in [1], the job is done within approximately 2 mins or less (with 24 cpus). With GotoBLAS2 and scalapack, however, it takes much longer time (more than 5 mins).

2) With GotoBlas2 and scalapack, I can run up to ~20 cpus (not 2 full nodes) and I get the following message in err file if I used more cpus:

[proxy:0:1@compute-0-12.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
[proxy:0:1@compute-0-12.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@compute-0-12.local] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec@compute-0-55.local] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
[mpiexec@compute-0-55.local] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@compute-0-55.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion
[mpiexec@compute-0-55.local] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion


The cpus of this cluster are based on Intel(R) Xeon(R) CPU X5650 @ 2.67GHz. I have done some research on compiling GotoBlas2 and, according to [2], believed that libgoto2.a and libgoto2_nehalemp-r1.13.a were the right libraries created. I linked these two libs when compiling scalapack too to obtain scalapack.a. After that, in compiling nwchem 6.3 I used the following configuration:

export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/home/tpirojsi/nwchem-mvapich2-6.3
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=OPENIB
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include/infiniband
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-lrdmacm -libumad -libverbs -lpthread -lrt"
export MSG_COMMS=MPI
export CC=gcc
export FC=gfortran
export NWCHEM_MODULES="all"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/mpi/gcc/mvapich2-1.7
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpich -lopa -lmpl -lrt -lpthread"
export BLASOPT="-L/home/tpirojsi/GotoBLAS2-install-1.13 -lgoto2 -lgoto2_nehalemp-r1.13 -llapack"
export USE_SCALAPACK=y
export SCALAPACK="-L/home/tpirojsi/scalapack-newtest/lib -lscalapack"
export USE_64TO32=y
cd $NWCHEM_TOP/src
make realclean
make nwchem_config
make 64_to_32
make >& make.log &


The nwchem binary was successfully built but 2 problems above have been experienced.
Also, there is a suggestion from [3] that "it may be necessary to take additional steps to propagate the LD_LIBRARY_PATH variable to each MPI process in the job". I wasn't sure how to do this but I gave it a try by adding the path to where the lib's of GotoBlas2 were in the .bash_profile and in the sge-based script using 'export' command (in bash shell) but the performance didn't seem to improve.

Any idea how to get around these issues please?

Tee