CCSD(T) OpenMP Threads


Click here for full thread
Clicked A Few Times
I have been studying the performance of CCSD(T)/aug-cc-pvqz calculations of a a (H2O)6 molecules (~1000 basis functions)
using both TCE and conventional algorithms on a KNL cluster with a copy of source code obtained from developer trunk,

OpenMP have been implemented and enabled ("NWChem w/ OpenMP: maximum threads = 4" printed in the top of output )

I run multiple MPI rank and OMP threads in each node, but, for the conventional CCSD algorithm, I found that no matter what OMP_NUM_THREADS is set, only 1 OMP thread is used in CCSD iteration ("Using 1 OpenMP thread(s) in CCSD" printed in the output), and "top" shows that only 1 MPI rank is running 4 threads and 1 rank using 2 threads, and all other ranks are single threads. Is it expected to work like this, or there is a load balancing problem?

Thanks!

 >echo $OMP_NUM_THREADS
4
>mpirun -perhost 11 -np 220 $EXE0 w6cage_ccsd.nw

 argument  1 = w6cage_ccsd.nw
NWChem w/ OpenMP: maximum threads = 4
============================== echo of input deck ============================== echo

start w6cage_ccsd

memory stack 8000 mb heap 100 mb global 10000 mb noverify
...
***** ccsd parameters *****
iprt = 0
convi = 0.100E-03
maxit = 20
mxvec = 5
memory 1060598348
Using 1 OpenMP thread(s) in CCSD
IO offset 20.0000000000000
IO error message >End of File
file_read_ga: failing reading from ./w6cage_ccsd.t2
Failed reading restart vector from ./w6cage_ccsd.t2
Using MP2 initial guess vector


-------------------------------------------------------------------------
iter correlation delta rms T2 Non-T2 Main
energy energy error ampl ampl Block
time time time
-------------------------------------------------------------------------
1 -1.7186198644 -1.719D+00 5.469D-01 4530.96 0.11 4443.36
2 -1.7539631587 -3.534D-02 2.744D-01 4524.76 0.11 4445.22

Top:
top - 10:53:41 up 2 days, 19:42,  1 user,  load average: 13.91, 13.81, 13.89
Tasks: 2169 total, 588 running, 1581 sleeping, 0 stopped, 0 zombie
%Cpu(s): 21.7 us, 0.8 sy, 0.0 ni, 77.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 19767712+total, 14988190+free, 41493892 used, 6301332 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14941336+avail Mem

PID USER PR NI VIRT RES SHR S  %CPU %MEM TIME+ COMMAND
21044 chiensh 20 0 5022244 2.483g 2.222g R 383.1 1.3 625:55.13 nwchem
21039 chiensh 20 0 12.614g 3.846g 239088 R 145.3 2.0 253:17.89 nwchem
21036 chiensh 20 0 12.626g 3.883g 270192 R 100.0 2.1 261:12.81 nwchem
21037 chiensh 20 0 12.626g 4.077g 269976 R 100.0 2.2 256:56.85 nwchem
21038 chiensh 20 0 12.611g 3.848g 239156 R 100.0 2.0 253:11.01 nwchem
21043 chiensh 20 0 12.611g 3.801g 181656 R 100.0 2.0 255:17.57 nwchem
21034 chiensh 20 0 12.644g 4.031g 302296 R 99.7 2.1 257:55.96 nwchem
21035 chiensh 20 0 12.625g 3.979g 272528 R 99.7 2.1 259:58.57 nwchem
21040 chiensh 20 0 12.614g 3.809g 239108 R 99.7 2.0 254:07.24 nwchem
21041 chiensh 20 0 12.610g 3.798g 238896 R 99.7 2.0 253:51.30 nwchem
21042 chiensh 20 0 12.611g 3.839g 190128 R 99.7 2.0 254:17.43 nwchem


Input
 echo
start w6cage_ccsd
memory stack 8000 mb heap 100 mb global 10000 mb noverify
geometry units angstrom noautoz noprint
...
end

basis "ao basis" spherical noprint
* library aug-cc-pvqz
end

scf
vectors input w6cage_ccsd.movecs
semidirect memsize 100000000 filesize 0
singlet
rhf
thresh 1e-7
tol2e 1e-14
end

ccsd
freeze atomic
NODISK
thresh 1e-4
end

task ccsd(t) energy

set ccsd:use_trpdrv_nb T
set ccsd:use_ccsd_omp T
set ccsd:use_trpdrv_omp T