SCF Performance for Different ARMCI Network on Socket-based KNL Cluster


Click here for full thread
Clicked A Few Times
Thanks Jeff!
However, I believe the main reason why my copy of ARMCI-MPi+CASPER is much slower because it is only running in single thread while other copies using at least 3 - 4 threads on each MPI rank; I don't know why OpenMP and MKL thread is not running in this case.

Thanks for Edo's help, I managed to get the (T) calculation completed using the source code downloaded from the developer trunk (in which setting USE_F90_ALLOCATABLE=T and USE_KNL=y, which defines DINTEL_64ALIGN in the compilation)

CCSD(T)
Using plain CCSD(T) code
Using sliced CCSD(T) code

CCSD[T] correction energy / hartree = -0.150973709276513
CCSD[T] correlation energy / hartree = -3.067895958443244
CCSD[T] total energy / hartree = -844.403379199049482
CCSD(T) correction energy / hartree = -0.147995713401607
CCSD(T) correlation energy / hartree = -3.064917962568339
CCSD(T) total energy / hartree = -844.400401203174624
Cpu & wall time / sec 66318.6 20408.5

Parallel integral file used 12303 records with 0 large values


Task times cpu: 68936.9s wall: 22442.4s



NWChem Input Module
-------------------


Summary of allocated global arrays
-----------------------------------
No active global arrays



On the other hand, in the developer trunk, updated subroutine grad_v_lr_loca in src/nwpw/pspw/lib/psp/psp.F will not be compiled by ifort Version 17.0.1.132 when O2/O3 and -qopenmp are used together due to an unknown compiler error:

ifort  -c -i8 -align -fpp -qopt-report-file=stderr -qopenmp -qopt-report-phase=openmp -qno-openmp-offload -fimf-arch-consistency=true -finline-limit=250 -O3  -unroll  -ip -xMIC-AVX512  -I.  -I/home/users/astar/ihpc/chiensh/nwchem-dev/src/include -I/home/users/astar/ihpc/chiensh/nwchem-dev/src/tools/install/include -DEXT_INT -DLINUX -DLINUX64 -DUSE_OPENMP  -DIFCV8 -DIFCLINUX -DINTEL_64ALIGN -DINTEL_64ALIGN -DSCALAPACK -DPARALLEL_DIAG -DUSE_F90_ALLOCATABLE psp.F
Intel(R) Advisor can now assist with vectorization and show optimization
report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


Begin optimization report for: PSP_PRJ_INDX_ALLOC_SW1A_SW2A

Report from: OpenMP optimizations [openmp]

psp.F(684:7-684:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_: OpenMP multithreaded code generation for MASTER was successful
psp.F(687:7-687:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_: OpenMP multithreaded code generation for BARRIER was successful
psp.F(700:7-700:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_: OpenMP multithreaded code generation for MASTER was successful
psp.F(704:7-704:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_: OpenMP multithreaded code generation for BARRIER was successful
=========================================================================== ifort: error #10105: /home/users/app/intel/compilers_and_libraries_2017.1.132/linux/bin/intel64/fortcom: core dumped
ifort: warning #10102: unknown signal(-326317584)
Segmentation fault (core dumped)


A small modification show in the following seems to be able to fix the problem

1236c       integer ftmp(2)
1237 integer ftmp(2),ftemp
...
1436 !$OMP DO
1437 do j=1,nion
1438 ftemp=ftmp(1)+3*(j-1)
1439 fion(1,j) = fion(1,j) + dbl_mb(ftemp)
1440 fion(2,j) = fion(2,j) + dbl_mb(ftemp+1)
1441 fion(3,j) = fion(3,j) + dbl_mb(ftemp+2)
1442
1443 c fion(1,j) = fion(1,j) + dbl_mb(ftmp(1)+3*(j-1))
1444 c fion(2,j) = fion(2,j) + dbl_mb(ftmp(1)+3*(j-1)+1)
1445 c fion(3,j) = fion(3,j) + dbl_mb(ftmp(1)+3*(j-1)+2)
1446 end do
1447 !$OMP END DO