NWchem with Gotoblas2 but poor performance

Clicked A Few Times

12:39:22 AM PDT - Tue, Oct 8th 2013
Hi, The cluster I'm working on is equipped with 720-core/1440-GB (60 dual-cpu/six-core/24GB nodes) 64-bit Xeon Cluster (from Dell), with InfiniBand network fabric. So each node contains 12 cpus. I managed to succesfully compile and install nwchem 6.3 with GotoBlas2 and scalapack included (using mvapich). However I have seen 2 major problems. 1) I've run a test case for a simple NEB calculation. With internal BLAS and LAPACK library as compiled in [1], the job is done within approximately 2 mins or less (with 24 cpus). With GotoBLAS2 and scalapack, however, it takes much longer time (more than 5 mins). 2) With GotoBlas2 and scalapack, I can run up to ~20 cpus (not 2 full nodes) and I get the following message in err file if I used more cpus: [proxy:0:1@compute-0-12.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed [proxy:0:1@compute-0-12.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:1@compute-0-12.local] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event [mpiexec@compute-0-55.local] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting [mpiexec@compute-0-55.local] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@compute-0-55.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion [mpiexec@compute-0-55.local] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion The cpus of this cluster are based on Intel(R) Xeon(R) CPU X5650 @ 2.67GHz. I have done some research on compiling GotoBlas2 and, according to [2], believed that libgoto2.a and libgoto2_nehalemp-r1.13.a were the right libraries created. I linked these two libs when compiling scalapack too to obtain scalapack.a. After that, in compiling nwchem 6.3 I used the following configuration: export LARGE_FILES=TRUE export USE_NOFSCHECK=TRUE export LIB_DEFINES="-DDFLT_TOT_MEM=16777216" export TCGRSH=/usr/bin/ssh export NWCHEM_TOP=/home/tpirojsi/nwchem-mvapich2-6.3 export NWCHEM_TARGET=LINUX64 export ARMCI_NETWORK=OPENIB export IB_HOME=/usr export IB_INCLUDE=$IB_HOME/include/infiniband export IB_LIB=$IB_HOME/lib64 export IB_LIB_NAME="-lrdmacm -libumad -libverbs -lpthread -lrt" export MSG_COMMS=MPI export CC=gcc export FC=gfortran export NWCHEM_MODULES="all" export USE_MPI=y export USE_MPIF=y export USE_MPIF4=y export MPI_LOC=/usr/mpi/gcc/mvapich2-1.7 export MPI_LIB=$MPI_LOC/lib export MPI_INCLUDE=$MPI_LOC/include export LIBMPI="-lmpich -lopa -lmpl -lrt -lpthread" export BLASOPT="-L/home/tpirojsi/GotoBLAS2-install-1.13 -lgoto2 -lgoto2_nehalemp-r1.13 -llapack" export USE_SCALAPACK=y export SCALAPACK="-L/home/tpirojsi/scalapack-newtest/lib -lscalapack" export USE_64TO32=y cd $NWCHEM_TOP/src make realclean make nwchem_config make 64_to_32 make >& make.log & The nwchem binary was successfully built but 2 problems above have been experienced. Also, there is a suggestion from [3] that "it may be necessary to take additional steps to propagate the LD_LIBRARY_PATH variable to each MPI process in the job". I wasn't sure how to do this but I gave it a try by adding the path to where the lib's of GotoBlas2 were in the .bash_profile and in the sge-based script using 'export' command (in bash shell) but the performance didn't seem to improve. Any idea how to get around these issues please? Tee

Forum Vet

9:48:03 AM PDT - Tue, Oct 8th 2013

set number of threads to 1

Have you set the env. variables


OMP_NUM_THREADS

and


GOTO_NUM_THREADS

to 1?
http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq

Clicked A Few Times

11:04:43 AM PDT - Tue, Oct 8th 2013
Quote:Edoapra Oct 8th 4:48 pm Have you set the env. variables OMP_NUM_THREADS and GOTO_NUM_THREADS to 1? http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq Edo, I didn't do anything about this 2 env variables. Do they need to be set in the submit script, in compiling step, or in .bash_profile? Also, the no. of thread to be set needs to be the same as the no. of cores used to run a job, right? Thank you, Tee

Clicked A Few Times

1:43:23 PM PDT - Tue, Oct 8th 2013
Quote:Tpirojsi Oct 8th 6:04 pm Quote:Edoapra Oct 8th 4:48 pm Have you set the env. variables OMP_NUM_THREADS and GOTO_NUM_THREADS to 1? http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq Edo, I didn't do anything about this 2 env variables. Do they need to be set in the submit script, in compiling step, or in .bash_profile? Also, the no. of thread to be set needs to be the same as the no. of cores used to run a job, right? Thank you, Tee Edo, never mind above questions. I set one of those environments (GOTO_NUM_THREADS) to 1 in submit script and tried running a job. It seemed to work perfectly now. However, The time used to perform a test case for 12 and 24 cpus (or up to 36 cpus) are approximately the same. This might be because of the small size of problem?

Forum Vet

3:25:09 PM PDT - Tue, Oct 8th 2013
Quote:Tpirojsi Oct 8th 12:43 pm The time for 12 and 24 cpus (or up to 36 cpus) are approximately the same. This might be because of the small size of problem? Yes, indeed

Forum >> NWChem's corner >> Running NWChem