SCF Performance for Different ARMCI Network on Socket-based KNL Cluster

Click here for full thread

Clicked A Few Times

9:44:45 PM PDT - Fri, Apr 7th 2017

Thanks Jeff!
However, I believe the main reason why my copy of ARMCI-MPi+CASPER is much slower because it is only running in single thread while other copies using at least 3 - 4 threads on each MPI rank; I don't know why OpenMP and MKL thread is not running in this case.

Thanks for Edo's help, I managed to get the (T) calculation completed using the source code downloaded from the developer trunk (in which setting USE_F90_ALLOCATABLE=T and USE_KNL=y, which defines DINTEL_64ALIGN in the compilation)

CCSD(T)

Using plain CCSD(T) code

Using sliced CCSD(T) code



CCSD[T]  correction energy / hartree =        -0.150973709276513

CCSD[T] correlation energy / hartree =        -3.067895958443244

CCSD[T] total energy / hartree       =      -844.403379199049482

CCSD(T)  correction energy / hartree =        -0.147995713401607

CCSD(T) correlation energy / hartree =        -3.064917962568339

CCSD(T) total energy / hartree       =      -844.400401203174624

Cpu & wall time / sec        66318.6        20408.5



Parallel integral file used   12303 records with       0 large values





Task  times  cpu:    68936.9s     wall:    22442.4s





                               NWChem Input Module

                                -------------------





 Summary of allocated global arrays

-----------------------------------

  No active global arrays

On the other hand, in the developer trunk, updated subroutine grad_v_lr_loca in src/nwpw/pspw/lib/psp/psp.F will not be compiled by ifort Version 17.0.1.132 when O2/O3 and -qopenmp are used together due to an unknown compiler error:

ifort  -c -i8 -align -fpp -qopt-report-file=stderr -qopenmp -qopt-report-phase=openmp -qno-openmp-offload -fimf-arch-consistency=true -finline-limit=250 -O3  -unroll  -ip -xMIC-AVX512  -I.  -I/home/users/astar/ihpc/chiensh/nwchem-dev/src/include -I/home/users/astar/ihpc/chiensh/nwchem-dev/src/tools/install/include -DEXT_INT -DLINUX -DLINUX64 -DUSE_OPENMP  -DIFCV8 -DIFCLINUX -DINTEL_64ALIGN -DINTEL_64ALIGN -DSCALAPACK -DPARALLEL_DIAG -DUSE_F90_ALLOCATABLE psp.F

Intel(R) Advisor can now assist with vectorization and show optimization

  report messages with your source code.

See "https://software.intel.com/en-us/intel-advisor-xe" for details.





Begin optimization report for: PSP_PRJ_INDX_ALLOC_SW1A_SW2A



   Report from: OpenMP optimizations [openmp]



psp.F(684:7-684:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_:  OpenMP multithreaded code generation for MASTER was successful

psp.F(687:7-687:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_:  OpenMP multithreaded code generation for BARRIER was successful

psp.F(700:7-700:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_:  OpenMP multithreaded code generation for MASTER was successful

psp.F(704:7-704:7):OMP:psp_prj_indx_alloc_sw1a_sw2a_:  OpenMP multithreaded code generation for BARRIER was successful

===========================================================================
ifort: error #10105: /home/users/app/intel/compilers_and_libraries_2017.1.132/linux/bin/intel64/fortcom: core dumped

ifort: warning #10102: unknown signal(-326317584)

Segmentation fault (core dumped)

A small modification show in the following seems to be able to fix the problem

1236c       integer ftmp(2)

1237       integer ftmp(2),ftemp

...

1436 !$OMP DO

1437       do j=1,nion

1438          ftemp=ftmp(1)+3*(j-1)

1439          fion(1,j) = fion(1,j) + dbl_mb(ftemp)

1440          fion(2,j) = fion(2,j) + dbl_mb(ftemp+1)

1441          fion(3,j) = fion(3,j) + dbl_mb(ftemp+2)

1442

1443 c         fion(1,j) = fion(1,j) + dbl_mb(ftmp(1)+3*(j-1))

1444 c         fion(2,j) = fion(2,j) + dbl_mb(ftmp(1)+3*(j-1)+1)

1445 c         fion(3,j) = fion(3,j) + dbl_mb(ftmp(1)+3*(j-1)+2)

1446       end do

1447 !$OMP END DO