Compiling nwchem-6.3 in a contemporary HPC with Xeon PHI


Clicked A Few Times
Hello all,

This is a task in progress, and I would very much appreciate if you have information that can accelerate my task.

I am trying to compile nwchem, so that it can benefit from various new features of XEONs (i.e. fuse multiply add which can increase the performance ~%40 in some cases), and automatic offloading of intel MKL libraries to phi accelerators.

Requirements:
  • A new intel mkl+compiler suite
  • MPI libraries (for scalapack)


The things I have done so far:
  • The environment variables SCALAPACK_LIB; LAPACK_LIB; BLAS_LIB; (but not OPTBLAS);FOPTIONS;FFLAGS;FFLAGS_FORGA;COPTIONS;CXXOPTIONS;CXXFLAGS;CFLAGS;CFLAGS_FORGA;CXXFLAGS_FORGA;FOPTIMIZE;COPTIMIZE used in numerous makefiles has to be set according to intel link line advisor for your setup.
  • Environment variable USE_OPENMP; has to be set, and CC;CXX;FC;F77;MPIF77;MPI_CXX;MPI_CC;MPI_F90 (again various makefiles have different ideas of what is what) has to be set to their corresponding mpi wrappers.
  • environment LINUX64 in src/config/makefile.h has so many workarounds for so many (long depreciated) systems that I ended up adding a new target (INTEL-CLUSTER) with appropriate settings for the system at hand. (Don't forget to edit src/peigs/DEFS so that INTEL-CLUSTER sends the correct flag -DLINUX64)
  • I handle underscore problem by using -assume nounderscore in fortran compilers, and removing the underscores from global arrays macros by hand. I would really appreciate if you have a better solution for this.
  • The rest is handled okay with contrib/distro-tools/build_nwchem. NWCHEM does not like absolute paths longer than 65 chars. However, it can not be anything less than nwchem-6.3. Since I can not control my home directory as a user I had to make a dirty link trick to get rid of this problem.


Best,
Baris
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Dr. O. Baris Malcioglu,
University of Liege,
Bât. B5 Physique de la matière condensée
allée du 6 Août 17
4000 Liège 1
Belgique

Clicked A Few Times
Update:
I have learned from the hard way that trying to handle underscore issues in nwchem is a full time job, and better be avoided. At the moment, I have a "compatibility mode" settings which compiles. I am currently doing tests in order to see if it is sensible.

A very interesting point is that, in my previous settings with mpif90, the fortran wrappers (i.e. those ending with _ using wrapper macro of global arrays) were not created, without a single indication why they are not created. You can only see whether they are created or not by using nm (nm -Ca libga.a)

The "compatible" ifort environment settings which ended in a yet untested binary
setenv MKLROOT /apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl
setenv NWCHEM_TARGET LINUX64
setenv BLASOPT "-L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm"
setenv BLAS_LIB "-L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm"
setenv BLAS_SIZE 8
setenv ARMCI_NETWORK OPENIB
setenv FC ifort
setenv CC icc
setenv CXX icpc
setenv FXX ifort
setenv MPI_CXX icpc
setenv MPI_CC icc
setenv MPI_F90 ifort
setenv FOPTIMIZE "-O3 -xavx -no-prec-div -funroll-loops"
setenv COPTIMIZE "-O3 -xavx -no-prec-div -funroll-loops"
setenv FFLAGS " -i8 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv FOPTIONS " -i8 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CFLAGS " -DMKL_ILP64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv COPTIONS " -DMKL_ILP64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CXXFLAGS " -DMKL_ILP64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CXXOPTIONS " -DMKL_ILP64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"


The remaining variables necessary for compilation are successfully detected by build_nwchem script
contrib/distro-tools/build_nwchem >& build.log

Clicked A Few Times
update
Yesterdays tests were successful, so today I tried enabling threading in MKL (openmp)
setenv MKLROOT /apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl
setenv NWCHEM_TARGET LINUX64
setenv BLASOPT "-L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64  -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lpthread -lm"
setenv BLAS_LIB "-L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64  -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lpthread -lm"
setenv BLAS_SIZE 8
setenv SCALAPACK_SIZE 8
setenv ARMCI_NETWORK OPENIB
setenv FC ifort
setenv CC icc
setenv CXX icpc
setenv FXX ifort
setenv MPI_CXX icpc
setenv MPI_CC icc
setenv MPI_F90 ifort
setenv FOPTIMIZE "-O3  -no-prec-div -funroll-loops"
setenv COPTIMIZE "-O3  -no-prec-div -funroll-loops"
setenv FFLAGS " -i8 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv FOPTIONS " -i8 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CFLAGS " -DMKL_ILP64 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv COPTIONS " -DMKL_ILP64 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CXXFLAGS " -DMKL_ILP64 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CXXOPTIONS " -DMKL_ILP64 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv USE_OPENMP y


However, this resulted in a binary that gives segmentation faults. I'll try with more conservative optimisations.

For the record, this is the full environment settings from build_nwchem

NWCHEM_TOP     = /home/hpc/mptf/mptf21/Prog/nwchem-6.3-lima
NWCHEM_TARGET  = LINUX64
NWCHEM_MODULES = all
USE_MPI        = y
USE_MPIF       = y
USE_MPIF4      = y
MPI_INCLUDE    = -I/apps/OpenMPI/1.7.2-intel13.1/include -I/apps/OpenMPI/1.7.2-intel13.1/lib
MPI_LIB        = -L/apps/OpenMPI/1.7.2-intel13.1/lib
LIBMPI         = -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi
MPI_F90        = ifort
MPI_CC         = icc
MPI_CXX        = icpc
ARMCI_NETWORK  = OPENIB
MSG_COMMS      = MPI
IB_INCLUDE     = /usr/include
IB_LIB         = /usr/lib64
IB_LIB_NAME    = -libverbs
BLASOPT        = -L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lpthread -lm

Clicked A Few Times
Comparing armci comex threading and PHI
After the QA tests (which I recommend comparing by hand, sometimes the "Failure" is a false alarm) I've done a small and straightforward benchmark using only 1 node, and I find the results surprising


Average timings for "armci" "armci+openmp" and "comex" libraries. Exactly the same job script. SCF + DFT energy job in a medium sized system. 1 node (40 processors, 64 gb ram), 10 processors used, OMP_NUM_THREADS=4 when applicable.
Time 
armci                    :Total times  cpu:     6401.5s     wall:     6414.5s
armci + openmp    :Total times  cpu:     6649.3s     wall:     6382.4s
comex                  :Total times  cpu:    31301.5s     wall:    31351.1s
armci + openmp with phi   :Total times  cpu:     6610.3s     wall:     6352.4s


1. Nothing is offloaded to the phi card
2. A sampling of the process for 15 minutes showed no threading. At least dgemm should have threaded within this time period.
3. COMEX binary in a single node is extremely slow (~5x)

I really don't know what went wrong. The binary seems to contain correct references, but are they called?

Is it normal for COMEX to be this slow?


Environment settings for Intel MIC automatic offload
OFFLOAD_REPORT=2
MKL_MIC_ENABLE=1

Clicked A Few Times
Environment settings used in compiling "comex" binary
Environment settings used in compiling "comex" binary
setenv MKLROOT /apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl
setenv NWCHEM_TARGET LINUX64
setenv BLASOPT "-L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm"
setenv BLAS_LIB "-L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm"
setenv BLAS_SIZE 8
setenv SCALAPACK_SIZE 8
setenv ARMCI_NETWORK MPI-TS
setenv FC ifort
setenv CC icc
setenv CXX icpc
setenv FXX ifort
setenv MPI_CXX icpc
setenv MPI_CC icc
setenv MPI_F90 ifort
setenv FOPTIMIZE "-O3 -xavx -no-prec-div -funroll-loops"
setenv COPTIMIZE "-O3 -xavx -no-prec-div -funroll-loops"
setenv FFLAGS " -i8 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv FOPTIONS " -i8 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CFLAGS " -DMKL_ILP64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv COPTIONS " -DMKL_ILP64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CXXFLAGS " -DMKL_ILP64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CXXOPTIONS " -DMKL_ILP64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv NWCHEM_MODULES all

Clicked A Few Times
Environment settings used in compiling "armci+openmp" binary (openib drivers are used)
setenv MKLROOT /apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl
setenv NWCHEM_TARGET LINUX64
setenv BLAS_LIB "-L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lpthread -lm"
setenv LAPACK_LIB "-L/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/lib/intel64 -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lpthread -lm"
setenv BLAS_SIZE 8
setenv LAPACK_SIZE 8
setenv SCALAPACK_SIZE 8
setenv ARMCI_NETWORK OPENIB
setenv FC ifort
setenv CC icc
setenv CXX icpc
setenv FXX ifort
setenv MPI_CXX icpc
setenv MPI_CC icc
setenv MPI_F90 ifort
setenv FOPTIMIZE "-O3 -openmp -xavx -no-prec-div -funroll-loops"
setenv COPTIMIZE "-O3 -openmp -xavx -no-prec-div -funroll-loops"
setenv FFLAGS " -i8 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv FOPTIONS " -i8 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CFLAGS " -DMKL_ILP64 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv COPTIONS " -DMKL_ILP64 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CXXFLAGS " -DMKL_ILP64 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv CXXOPTIONS " -DMKL_ILP64 -openmp -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include/intel64/ilp64 -I/apps/intel/ComposerXE2013/composer_xe_2013.5.192/mkl/include"
setenv USE_OPENMP y
setenv NWCHEM_MODULES all
setenv USE_MPI y
setenv USE_MPIF y
setenv USE_MPIF4 y
setenv MPI_INCLUDE "-I/apps/OpenMPI/1.6.5-intel13.1/include -I/apps/OpenMPI/1.6.5-intel13.1/lib"
setenv MPI_LIB -L/apps/OpenMPI/1.6.5-intel13.1/lib
setenv LIBMPI "-lmpi_f90 -lmpi_f77 -lmpi -ldl -lm -lrt -lnsl -lutil"
setenv ARMCI_NETWORK OPENIB
setenv MSG_COMMS MPI
setenv IB_INCLUDE /usr/include
setenv IB_LIB /usr/lib64
setenv IB_LIB_NAME -libverbs

Clicked A Few Times
-
-

Clicked A Few Times
Single node straightforward benchmarks for comex-ofa mpi-pr and mpi-spawn
The same settings above, with new ARMCI_NETWORK settings.

armci-openmp:Total times  cpu:     6649.3s     wall:     6382.4s
comex-ofa:Total times  cpu:     6801.7s     wall:     6445.6s
mpi-spawn:Total times  cpu:     6718.9s     wall:     6483.8s
mpi-pr:Total times  cpu:     6667.2s     wall:     6428.9s


These communication methods seem to provide a performance-wise viable alternative if you are having issues with ARMCI_DEFAULT_SHMMAX

Clicked A Few Times
Successful scalapack compilation
This is a corner cutting method. Perhaps a better way exists.
  • Intel MKL 11 or newer scalapack cause strange segmentation faults with Openmpi 1.6-1.8. These segmentation faults do not appear with intel mpi.
  • build_nwchem script overrides my settings, I'll not use it anymore.
  • I edited config/makefile.h so that intel compiler settings are applied to mpif90
  • QA tests in single cpu still run 10 times faster than my fastest binary, and my binary is much faster than the binary supplied with ubuntu for example. How do they achieve that performance?
  • To Do Next: In order to compile nwchem for the mic card, I have to somehow cross compile ga, or make nwchem accept a cross compiled one.

The following settings work, and pass QA tests when compiled with
source settings.csh;cd src;make nwchem_config;make CC=mpicc FC=mpif90 CXX=mpicxx
settings.csh:
module purge
module load intelmpi/4.1.3.048-intel intel64/13.1up03

setenv NWCHEM_TARGET LINUX64
setenv NWCHEM_TOP /wherever nwchem sits, but name shortened from default/
setenv MPI_ROOT /wherever intel sits/intel/mpi/4.1.3.048/include64/

setenv LAPACK_LIB "-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
setenv BLAS_LIB "-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
setenv SCALAPACK_LIB "-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
setenv BLASOPT "-L$MKLROOT/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_core -lmkl_intel_thread -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
setenv USE_SCALAPACK_I8 y
setenv BLAS_SIZE 8
setenv LAPACK_SIZE 8
setenv SCALAPACK_SIZE 8
setenv FC ifort
setenv CC icc
setenv CXX icpc
setenv FXX ifort
setenv MPI_CXX icpc
setenv MPI_CC icc
setenv MPI_F90 ifort
setenv FOPTIMIZE "-O3 -openmp -static-intel  -no-prec-div -funroll-loops"
setenv COPTIMIZE "-O3 -openmp -static-intel  -no-prec-div -funroll-loops"
setenv FFLAGS " -i8 -openmp -static-intel -I$MKLROOT/include/intel64/ilp64 -I$MKLROOT/include"
setenv FOPTIONS " -i8 -openmp -static-intel -I$MKLROOT/include/intel64/ilp64 -I$MKLROOT/include"
setenv CFLAGS " -DMKL_ILP64 -openmp -static-intel -ftz -I$MKLROOT/include/intel64/ilp64 -I$MKLROOT/include"
setenv CFLAGS_FORGA " -DMKL_ILP64 -ftz -openmp -static-intel -I$MKLROOT/include/intel64/ilp64 -I$MKLROOT/include"
setenv COPTIONS " -DMKL_ILP64 -ftz -openmp -static-intel -I$MKLROOT/include/intel64/ilp64 -I$MKLROOT/include"
setenv CXXFLAGS " -DMKL_ILP64 -ftz -openmp -static-intel -I$MKLROOT/include/intel64/ilp64 -I$MKLROOT/include"
setenv CXXOPTIONS " -DMKL_ILP64 -ftz -openmp -static-intel -st-I$MKLROOT/include/intel64/ilp64 -I$MKLROOT/include"
setenv USE_OPENMP y
setenv NWCHEM_MODULES all
setenv USE_MPI y
setenv USE_MPIF y
setenv USE_MPIF4 y
setenv MPI_INCLUDE "-I$MPI_ROOT/include64 -I$MPI_ROOT/lib64"
setenv MPI_LIB -L$MPI_ROOT/lib
setenv LIBMPI "-lmpi_f90 -lmpi_f77 -lmpi -ldl -lm -lrt -lnsl -lutil"
setenv ARMCI_NETWORK OFA
setenv MSG_COMMS MPI
#Following settings are often true for IB Linux clusters
setenv IB_INCLUDE /usr/include
setenv IB_LIB /usr/lib64
setenv IB_LIB_NAME -libverbs

Gets Around
QA tests in single cpu still run 10 times faster than my fastest binary, and my binary is much faster than the binary supplied with ubuntu for example. How do they achieve that performance?

Which QA tests? If they are tests with significant disk I/O that can dominate wall clock time. It's possible that the single-processor QA outputs were generated on systems with fast scratch disk, like a RAID system, SSD, or (even faster) something like Lustre running over a fast network.

Clicked A Few Times
details
Quote:Mernst Jun 22nd 1:57 am


Which QA tests? If they are tests with significant disk I/O that can dominate wall clock time. It's possible that the single-processor QA outputs were generated on systems with fast scratch disk, like a RAID system, SSD, or (even faster) something like Lustre running over a fast network.


Hi Mernst,

All of them

For example h2o_cg_to_diag_ub3lyp. The test quotes 1.0 seconds, while mine took 14 seconds. This is a very serious difference.

I am working on a hardware Lustre. As far as I can see, system wait is negligible in most of the cases. The cpu is an Ivy-Bridge Xeon.

I think it is the global arrays, but I don't understand why it is so slow.

Gets Around
Hi Obm,

That performance is pretty dire. h2o_cg_to_diag_ub3lyp finishes in 0.5 seconds for me, using the June 2014 dev snapshot on a 2.4 GHz Ivy Bridge i7. I'm running Ubuntu 14.04 and I built NWChem using the gnu compilers and OpenBLAS 0.28.

You said that your binary runs faster than the one supplied by Ubuntu? Even the unoptimized Ubuntu package I'd expect to take a second or less on that test on a modern Xeon. In my experience the default Ubuntu package performs close to a good build with the gnu compilers unless you're running post-HF calculations, where the BLAS really starts to matter and the performance gap widens. A build with the Intel compilers and MKL and no special changes is faster but by less than a factor of two.

Clicked A Few Times
Quote:Mernst Jun 22nd 5:11 am

I'd expect to take a second or less on that test on a modern Xeon. In my experience the default Ubuntu package performs close to a good build with the gnu compilers unless you're running post-HF calculations, where the BLAS really starts to matter and the performance gap widens. A build with the Intel compilers and MKL and no special changes is faster but by less than a factor of two.


Good morning Mernst,
May I ask if you are using 64 bit integers? (INTERFACE64 = 1 in OpenBLAS) Which global arrays version and message passing interface? I am curious, and setting up some timings comparing intel mpi/mkl/compilers with openmpi/gcc/openBLAS now. I will use 64 bit interface on both, with message passing set to mpi-spawn. (I can not set this to openIB sadly)

Clicked A Few Times
Test input and nwchem timings
Test input:
START test2
TITLE "test 2"
memory global 64000 mb stack 2000 mb heap 1500 mb
GEOMETRY "large" noautoz
N                    -1.18815305     1.64295245    -0.19797254
N                    -1.71429898    -1.23789933     0.05187672
N                     1.71919840     1.24204116    -0.04004607
N                     1.19329270    -1.63905228     0.20931043
C                    -3.39234141     0.55140590    -0.18307121
C                    -2.55116767     1.67114843    -0.25656255
C                    -3.00650182    -0.78302079    -0.04065810
C                    -0.78925710     2.94324537    -0.30799030
C                    -1.68501065    -2.60423235     0.18366171
C                     0.53998837     3.39037732    -0.29831220
C                    -0.53479978    -3.38669954     0.30417660
C                    -3.04194847     3.03938524    -0.40944075
C                    -3.85584599    -1.93945436     0.03744348
H                    -4.46487217     0.73318080    -0.24273597
C                     1.69012467     2.60775608    -0.17760069
C                     0.79430625    -2.93942407     0.31589108
C                    -1.93904510     3.83567875    -0.44172139
C                    -3.05192907    -3.04745551     0.17375667
C                     3.39754200    -0.54765268     0.18864429
C                     3.01155038     0.78677082     0.04679689
C                     2.55630718    -1.66722017     0.26450348
H                     0.70250333     4.46305355    -0.39769395
H                    -0.69671819    -4.45989306     0.39874007
C                     3.05714314     3.05037064    -0.17603592
C                     1.94447052    -3.83227883     0.44539478
C                     3.86123572     1.94232936    -0.03987541
C                     3.04726942    -3.03598779     0.41290904
H                     4.47027977    -0.72966109     0.24346725
H                    -4.08659494     3.33147621    -0.48103833
H                    -4.94025155    -1.90798531    -0.00573877
H                    -1.89101055     4.91670010    -0.54524164
H                    -3.36116456    -4.08450190     0.26140222
H                     3.36594506     4.08689492    -0.27078111
H                     1.89662799    -4.91366014     0.54552638
H                     4.94587903     1.90997603    -0.00360067
H                     4.09206136    -3.32832389     0.48106937
H                    -0.89115074    -0.64366149     0.02720520
H                     0.89618219     0.64803626    -0.01343957
END
BASIS "medium" SPHERICAL
 * library 6-31+g*
END


echo
###################Production run
set "ao basis" "medium"
set geometry "large"
SCF
 vectors output test2.movecs
END

DFT
 iterations 100
 decomp
 direct
 XC b3lyp
 vectors input  test2.movecs
 vectors output test2.movecs
END

TASK SCF ENERGY
TASK DFT ENERGY

Run:
mpirun -np 40 (all the processors in the node)
Timings (as reported by nwchem):
Intel scalapack:Total times  cpu:      432.4s     wall:      452.3s
GCC + openBLAS:Total times cpu: 833.7s wall: 929.8s

Gets Around
Obm,

I am using the 64 bit integer interface to OpenBLAS, and I have compiled OpenBLAS without threading (USE_THREAD=0) for NWChem.

Between several BLAS-linked programs I use, some prefer the 32 bit integer interface, some prefer 64 bit, and some prefer threads on or off. I actually have all 4 permutations of OpenBLAS threading and integer size installed, and I use wrapper scripts that set LD_LIBRARY_PATH to make sure that each program finds its preferred OpenBLAS variant.

I am using the default openmpi 1.6.5 that is packaged for Ubuntu 14.04.

I am using the bundled Global Arrays that came with the June 2014 NWChem snapshot: http://nwchemgit.github.io/download.php?f=Nwchem-dev.revision25716-src.2014-06-09.tar.gz

The bundled GA is svn revision 10496 of Global Arrays, but I don't know what official version number that corresponds to because I can't view the svn repository for Global Arrays. My ARMCI_NETWORK=SOCKETS.

I only have 4 physical cores -- I run NWChem on a laptop -- and this was the timing I saw for your test input above running on 4 cores:

Total times cpu: 3428.8s wall: 3442.3s

I don't know what sort of scaling to expect for this job on 40 cores, but your speedup seems to be substantial, unlike what you observed for tiny QA calculations.

Maybe you are only having speed problems with small jobs? I have seen mysterious slowdown for single-node execution on my laptop depending on the ARMCI_NETWORK setting. See for example the later posts in this thread: http://nwchemgit.github.io/Special_AWCforum/st/id1303/compile_nwchem-6-3_with_open...


Forum >> NWChem's corner >> Compiling NWChem