NWchem with Gotoblas2 but poor performance


Clicked A Few Times
Hi,

The cluster I'm working on is equipped with 720-core/1440-GB (60 dual-cpu/six-core/24GB nodes) 64-bit Xeon Cluster (from Dell), with InfiniBand network fabric. So each node contains 12 cpus.

I managed to succesfully compile and install nwchem 6.3 with GotoBlas2 and scalapack included (using mvapich). However I have seen 2 major problems.

1) I've run a test case for a simple NEB calculation. With internal BLAS and LAPACK library as compiled in [1], the job is done within approximately 2 mins or less (with 24 cpus). With GotoBLAS2 and scalapack, however, it takes much longer time (more than 5 mins).

2) With GotoBlas2 and scalapack, I can run up to ~20 cpus (not 2 full nodes) and I get the following message in err file if I used more cpus:

[proxy:0:1@compute-0-12.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
[proxy:0:1@compute-0-12.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@compute-0-12.local] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec@compute-0-55.local] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
[mpiexec@compute-0-55.local] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@compute-0-55.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion
[mpiexec@compute-0-55.local] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion


The cpus of this cluster are based on Intel(R) Xeon(R) CPU X5650 @ 2.67GHz. I have done some research on compiling GotoBlas2 and, according to [2], believed that libgoto2.a and libgoto2_nehalemp-r1.13.a were the right libraries created. I linked these two libs when compiling scalapack too to obtain scalapack.a. After that, in compiling nwchem 6.3 I used the following configuration:

export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/home/tpirojsi/nwchem-mvapich2-6.3
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=OPENIB
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include/infiniband
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-lrdmacm -libumad -libverbs -lpthread -lrt"
export MSG_COMMS=MPI
export CC=gcc
export FC=gfortran
export NWCHEM_MODULES="all"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/mpi/gcc/mvapich2-1.7
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpich -lopa -lmpl -lrt -lpthread"
export BLASOPT="-L/home/tpirojsi/GotoBLAS2-install-1.13 -lgoto2 -lgoto2_nehalemp-r1.13 -llapack"
export USE_SCALAPACK=y
export SCALAPACK="-L/home/tpirojsi/scalapack-newtest/lib -lscalapack"
export USE_64TO32=y
cd $NWCHEM_TOP/src
make realclean
make nwchem_config
make 64_to_32
make >& make.log &


The nwchem binary was successfully built but 2 problems above have been experienced.
Also, there is a suggestion from [3] that "it may be necessary to take additional steps to propagate the LD_LIBRARY_PATH variable to each MPI process in the job". I wasn't sure how to do this but I gave it a try by adding the path to where the lib's of GotoBlas2 were in the .bash_profile and in the sge-based script using 'export' command (in bash shell) but the performance didn't seem to improve.

Any idea how to get around these issues please?

Tee

Forum Vet
set number of threads to 1
Have you set the env. variables

OMP_NUM_THREADS

and

GOTO_NUM_THREADS

to 1?
http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq

Clicked A Few Times
Quote:Edoapra Oct 8th 4:48 pm
Have you set the env. variables

OMP_NUM_THREADS

and

GOTO_NUM_THREADS

to 1?
http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq



Edo, I didn't do anything about this 2 env variables. Do they need to be set in the submit script, in compiling step, or in .bash_profile? Also, the no. of thread to be set needs to be the same as the no. of cores used to run a job, right?

Thank you,
Tee

Clicked A Few Times
Quote:Tpirojsi Oct 8th 6:04 pm
Quote:Edoapra Oct 8th 4:48 pm
Have you set the env. variables

OMP_NUM_THREADS

and

GOTO_NUM_THREADS

to 1?
http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq



Edo, I didn't do anything about this 2 env variables. Do they need to be set in the submit script, in compiling step, or in .bash_profile? Also, the no. of thread to be set needs to be the same as the no. of cores used to run a job, right?

Thank you,
Tee


Edo, never mind above questions. I set one of those environments (GOTO_NUM_THREADS) to 1 in submit script and tried running a job. It seemed to work perfectly now.

However, The time used to perform a test case for 12 and 24 cpus (or up to 36 cpus) are approximately the same. This might be because of the small size of problem?

Forum Vet
Quote:Tpirojsi Oct 8th 12:43 pm
The time for 12 and 24 cpus (or up to 36 cpus) are approximately the same. This might be because of the small size of problem?


Yes, indeed


Forum >> NWChem's corner >> Running NWChem