Parallelization causes a steep increase in user time

Just Got Here

3:07:12 AM PST - Sat, Jan 19th 2019

I'm running this nwchem.nw on one machine with 4CPUs / 8 virtual CPUs:

geometry nocenter noautosym
  C 0.2265688 -0.56580271 0.37053473
  N 1.53104275 -1.25352933 0.37698414
  C 0.24487845 0.94565579 0.82884627
  C -1.18278502 1.53035594 0.91007832
  C 0.94733822 1.06970643 2.19799655
  C -0.22962089 -0.60972299 -1.08178558
  O 0.51218343 -0.53807897 -2.07032273
  O -1.61832134 -0.67908207 -1.18057738
  H -0.50313117 -1.10973073 0.9958513
  H 2.01037607 -1.19295784 1.28583059
  H 2.11635497 -0.8912782 -0.39447838
  H 0.83331353 1.49436672 0.06509178
  H -1.741318 1.35179101 -0.02207033
  H -1.14539339 2.61527111 1.11156942
  H -1.73810741 1.04465878 1.7344188
  H 0.9927738 2.12810382 2.50783434
  H 1.98011451 0.68259268 2.15005599
  H 0.385448 0.50413825 2.96567565
  H -1.83911536 -0.65210679 -2.16786412
end
start   
basis   
  * library 3-21G
end

dft
  xc xpbe96 cpbe96
  mult 1
end     

task dft gradient

Parallelizing only speeds it up <2X:

$ time nwchem nwchem.nw
# skip...
real    0m56.131s
user    0m53.700s
sys     0m1.297s

$ time mpirun -n 2 nwchem nwchem.nw
# skip...
real    0m45.799s
user    1m15.131s
sys     0m14.534s

$ time mpirun -n 4 nwchem nwchem.nw
# skip...
real    0m36.546s
user    1m48.988s
sys     0m33.324s

$ time mpirun -n 8 nwchem nwchem.nw
# skip...
real    0m32.027s
user    2m52.518s
sys     1m2.363s

Increasing the number of CPUs causes steep increase in user time.
Is something wrong? Why doesn't it speed up more?
mpich-3.2.1 on FreeBSD

Would using OpenMPI improve the performance in this case?

Forum >> NWChem's corner >> Running NWChem