CCSD(T) Calculation with Quadruple Zeta Basis Set -- Memory Issue

Hello NWChem Developer,

I am trying to run some CCSD(T) energy calculations with quadruple-zeta basis set of a 5-atom system but it seems the memory requirement of the 2-e file size is a little off the chart (> 100GB). The input file reads:


memory stack 1300 mb heap 200 mb global 1500 mb

start im

title "im"
charge 1

geometry units angstroms print xyz noautosym noautoz
C                  -2.23423902     0.59425408    -0.03224283
O -1.12129315 1.09129114 -0.09445519
O -3.30588587 0.19083810 0.02028232
Br 1.41553615 -0.39477191 0.02227492
H -0.18608027 0.45084374 -0.04234683

C  library aug-cc-pvqz
H library aug-cc-pvqz
O library aug-cc-pvqz
  1. BASIS SET: (15s,12p,13d,3f,2g) -> [7s,6p,5d,3f,2g]
Br S
 78967.5000000              0.0000280             -0.0000110
11809.7000000 0.0002140 -0.0000860
2687.1400000 0.0010560 -0.0004350
760.0360000 0.0036880 -0.0014570
241.8110000 0.0079340 -0.0033810
38.4914000 0.1528680 -0.0576580
24.0586000 -0.2786020 0.1123250
14.3587000 -0.2188500 0.0756730
... (to keep it short)

Br nelec 10
Br ul
2 1.0000000 0.0000000
Br S

THRESH 1.0e-5
TOL2E 1e-7

FREEZE atomic
thresh 1e-6
maxiter 100

task tce

The error message reads

2-e (intermediate) file size =    106977507300
2-e (intermediate) file name = ./im.v2i
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
available GA memory 1572841816 bytes
createfile: failed ga_create size/nproc bytes 5348875365

I could change the memory options at the beginning of the file but it just seems a little unrealistic to have GA as large as 100G for the nodes that I am using (two Intel Xeon E5-2680v2 “Ivy Bridge” 10-core, 2.8GHz processors, which is 20 cores total, and 128 GB of memory, 6.8GB per core).

I have also tried different "IO" and "2emet" options, for example,

FREEZE atomic
thresh 1e-6
maxiter 100
2emet 13
tilesize 10
attilesize 40
set tce:xmem 100

 tilesize 2
io ga
2EMET 15
idiskx 1
FREEZE atomic
thresh 1e-6
maxiter 100

but the job seems to hang there after printing out "v2 file size = "

Any insight on this issue is greatly appreciated!

Thank you in advance,

createfile: failed ga_create size/nproc bytes          5348875365

5348875365=5348875365/1024/1024/1024=4.98 GB

Please change the memory line to

memory stack 1300 mb heap 200 mb global 6000 mb

Thank you for the prompt response, Edoapra.

I had to adjust the memory to
memory stack 1000 mb heap 100 mb global 5300 mb

so it does not exceed the memory of the core (6.8 GB)

but now run into an error like the following:
slurmstepd: error: Step 3840722.0 exceeded memory limit (123363455 > 122880000), being killed
slurmstepd: error: Step 3840722.0 exceeded memory limit (123618673 > 122880000), being killed
slurmstepd: error: Step 3840722.0 exceeded memory limit (123451708 > 122880000), being killed
slurmstepd: error: *** STEP 3840722.0 ON prod2-0143 CANCELLED AT 2018-06-20T04:05:00 ***
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: Exceeded job memory limit
srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
srun: error: prod2-0148: tasks 100-119: Killed
srun: error: prod2-0150: tasks 140-159: Killed
srun: error: prod2-0149: tasks 120-139: Killed
slurmstepd: error: _get_pss: ferror() indicates error on file /proc/156552/smaps
slurmstepd: error: _get_pss: ferror() indicates error on file /proc/135960/smaps
srun: error: prod2-0145: tasks 41,43,45,47,49,51,53,55,57,59: Killed
srun: error: prod2-0146: tasks 63,65,69,71,75,77,79: Killed
slurmstepd: error: _get_pss: ferror() indicates error on file /proc/234980/smaps
srun: error: prod2-0145: tasks 40,42,44,46,48,50,52,54,56,58: Killed
srun: error: prod2-0146: tasks 61,67,73: Killed
srun: error: prod2-0143: tasks 0-19: Killed
slurmstepd: error: _get_pss: ferror() indicates error on file /proc/77821/smaps
srun: error: prod2-0146: tasks 60,62,64,66,68,70,72,74,76,78: Killed
srun: error: prod2-0144: tasks 20-39: Killed
slurmstepd: error: _get_pss: ferror() indicates error on file /proc/17624/smaps
srun: error: prod2-0147: tasks 80-99: Killed

I have also tried another memory allocation
memory stack 400 mb heap 100 mb global 6000 mb

and it yielded a different error
2-e (intermediate) file size = 107432197225
2-e (intermediate) file name = ./vim.v2i
tce_ao2e: MA problem k_ijkl 18
current input line :
For more information see the NWChem manual at

For further details see manual section:

Currently I am using 160 cores -- Do you think I should try to use more cores so the GA allocation on each core is less?

Thank you very much,

More CPUs, still failed
In the hope of reducing the memory requirement on each core, I tested the job with 200 cores (increased 160 cores). However, it seems the computer could not allocate the correct amount of memory for MA. For example, the memory line reads:
memory stack 900 mb heap 200 mb global 4300 mb

but the error message shows:
tce_ao2e: fast2e=1
half-transformed integrals in memory

2-e (intermediate) file size =    107432197225
2-e (intermediate) file name = ./vim.v2i
Cpu & wall time / sec 214.7 266.1
available GA memory 211394680 bytes
createfile: failed ga_create size/nproc bytes 3079838825
current input line :
129: task tce

even though clearly the input file was trying to allocate 4300 mb for GA.

Would you please let me know how to fix this?

Thank you,

Please report the tce input block you are currently using and number of processors

The TCE input block reads:

FREEZE atomic
thresh 1e-6
maxiter 100

I am currently using 10 nodes with 20 cores per node. The memory on each core is 6GB. The job script reads:
  1. SBATCH --job-name=vim
  2. SBATCH --partition=kill.q
  3. SBATCH --exclusive
  4. SBATCH --nodes=10
  5. SBATCH --tasks-per-node=20
  6. SBATCH --cpus-per-task=1
  7. SBATCH --error=%A.err
  8. SBATCH --time=0-10:59:59 ## time format is DD-HH:MM:SS
  9. SBATCH --output=%A.out

export I_MPI_FABRICS=shm:tmi
export I_MPI_PMI_LIBRARY=/opt/local/slurm/default/lib64/

source /global/opt/intel_2016/mkl/bin/ intel64

module load intel_2016/ics intel_2016/impi

export MPIRUN_PATH="srun"
export MPIRUN_NPOPT="-n"
export INPUT="vim"


Thank you!

Please try the following input
permanent_dir /global/cscratch1/sd/apra/arar
memory stack 1300 mb heap 200 mb global 7000 mb

start im

title "im"
charge 1

geometry #units angstroms print xyz noautosym noautoz
 C                  -2.23423902     0.59425408    -0.03224283
 O                  -1.12129315     1.09129114    -0.09445519
 O                  -3.30588587     0.19083810     0.02028232
 Br                  1.41553615    -0.39477191     0.02227492
 H                  -0.18608027     0.45084374    -0.04234683

basis spherical
 C  library aug-cc-pvqz
 H  library aug-cc-pvqz
 O  library aug-cc-pvqz
 Br  library aug-cc-pvqz-pp

 Br  library aug-cc-pvqz-pp

  THRESH 1.0e-5
  TOL2E 1e-12

  FREEZE atomic
  tilesize 8
  attilesize 12
  thresh 1e-6
  maxiter 100

task tce

Thank you, Edoapra.

Just want to make sure I understand this correctly.

I should try to use 200 cores and each core should allocate the following amount of memory?
memory stack 1300 mb heap 200 mb global 7000 mb

Quote:Srhhh Jun 20th 6:41 pm
Thank you, Edoapra.

Just want to make sure I understand this correctly.

I should try to use 200 cores and each core should allocate the following amount of memory?
memory stack 1300 mb heap 200 mb global 7000 mb

You should use only 10 tasks-per-node for a total of 100 cores since you mentioned that you have 6GB/core

Thank you, Edoapra.

I managed to get more core (20 nodes, 400 cores) so I was able to run without MA allocation issue with the following memory line:
memory stack 1000 mb heap 200 mb global 4400 mb

Everything else in the input file is identified as in your previous comment. It took about 5 mins for the calculation to go to
tce_ao2e: fast2e=1
half-transformed integrals in memory

2-e (intermediate) file size = 105684005025
2-e (intermediate) file name = ./vim.v2i
Cpu & wall time / sec 144.8 184.8

tce_mo2e: fast2e=1
2-e integrals stored in memory

but the calculation has been hanging there for over eight hours -- nothing got written into the folder or output file at all. I also noticed there were some vim.aoints.x files that seem to have not been cleaned up properly. Is the behavior normal for this size of a calculation or this QZ calculation is pushing the limit of NWChem?

Thanks again.

Unstable CCSD iterations
The test run in the previous comment actually went to the CCSD iterations part (each iteration takes about 1 hour wall time) but the iterations seem unstable. Please see below:

t2 file handle = -995

CCSD iterations
Iter Residuum Correlation Cpu Wall V2*C2
1 0.3745619466040 -1.0830661992146 1975.7 3034.0 759.3
2 0.3338130779425 -1.0377329617715 1992.9 3058.0 760.8
3 7.2614902105214 -1.0607684520852 1991.8 3049.5 762.0
4 60.1400573985661 -1.0597624893767 1986.2 3038.5 759.7
51384.5956104600380 -1.0695691959406 1993.2 3050.9 765.8

The geometry of this calculation was optimized from ccsd(t)/aug-cc-pvTZ basis set so this error should not be from a bad geometry.

Thank you!

Quote:Srhhh Jun 21st 5:52 pm
The test run in the previous comment actually went to the CCSD iterations part (each iteration takes about 1 hour wall time) but the iterations seem unstable.

Thank you!

Did you use a spherical or cartesian basis?

I had a Cartesian basis and now changed to spherical. I will update you how this test goes.

Another problem just happened:

[25] Received an Error in Communication: (-991) 25:nga_get_common:cannot locate region: ./vim.r1.d1 [18591:18511 ,1:1 ]:
[212] Received an Error in Communication: (-991) 212:nga_get_common:cannot locate region: ./vim.r1.d1 [18526:18511 ,1:1 ]:
application called MPI_Abort(comm=0x84000000, -991) - process 212
[173] Received an Error in Communication: (-991) 173:nga_get_common:cannot locate region: ./vim.r1.d1 [18721:18511 ,1:1 ]:
application called MPI_Abort(comm=0x84000000, -991) - process 173
application called MPI_Abort(comm=0x84000000, -991) - process 25
[179] Received an Error in Communication: (-991) 179:nga_get_common:cannot locate region: ./vim.r1.d1 [18656:18511 ,1:1 ]:
application called MPI_Abort(comm=0x84000000, -991) - process 179
srun: error: prod2-0101: task 212: Exited with exit code 33
srun: error: prod2-0029: task 25: Exited with exit code 33
srun: error: prod2-0096: tasks 173,179: Exited with exit code 33

From what I can find online this seems to be also related to memory (even though MA 'test' passed) and CCSD iterations started. Does DIIS require additional memories?

Thank you

This calculation is very hardware-demanding. I have tried NWCHEM6.8 on MAC to using aug-cc-pvdz.

Iterations converged
CCSD correlation energy / hartree = ...       
CCSD total energy / hartree = ...

Singles contributions

Doubles contributions
CCSD[T]  correction energy / hartree =       ...
CCSD[T] correlation energy / hartree = ...
CCSD(T) correction energy / hartree = ...
CCSD(T) correlation energy / hartree = ...
CCSD(T) total energy / hartree = ...

Please cite the following reference when publishing
results obtained with NWChem:

                M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski,
T.P. Straatsma, H.J.J. van Dam, D. Wang, J. Nieplocha,
E. Apra, T.L. Windus, W.A. de Jong
"NWChem: a comprehensive and scalable open-source
solution for large scale molecular simulations"
Comput. Phys. Commun. 181, 1477 (2010)

E. Apra, E. J. Bylaska, W. A. de Jong, N. Govind, K. Kowalski,
T. P. Straatsma, M. Valiev, H. J. J. van Dam, D. Wang, T. L. Windus,
J. Hammond, J. Autschbach, K. Bhaskaran-Nair, J. Brabec, K. Lopata,
S. A. Fischer, S. Krishnamoorthy, M. Jacquelin, W. Ma, M. Klemm, O. Villa,
Y. Chen, V. Anisimov, F. Aquino, S. Hirata, M. T. Hackler, V. Konjkov,
D. Mejia-Rodriguez, T. Risthaus, M. Malagoli, A. Marenich,
A. Otero-de-la-Roza, J. Mullin, P. Nichols, R. Peverati, J. Pittner, Y. Zhao,
P.-D. Fan, A. Fonari, M. J. Williamson, R. J. Harrison, J. R. Rehr,
M. Dupuis, D. Silverstein, D. M. A. Smith, J. Nieplocha, V. Tipparaju,
M. Krishnan, B. E. Van Kuiken, A. Vazquez-Mayagoitia, L. Jensen, M. Swart,
Q. Wu, T. Van Voorhis, A. A. Auer, M. Nooijen, L. D. Crosby, E. Brown,
G. Cisneros, G. I. Fann, H. Fruchtl, J. Garza, K. Hirao, R. A. Kendall,
J. A. Nichols, K. Tsemekhman, K. Wolinski, J. Anchell, D. E. Bernholdt,
P. Borowski, T. Clark, D. Clerc, H. Dachsel, M. J. O. Deegan, K. Dyall,
D. Elwood, E. Glendening, M. Gutowski, A. C. Hess, J. Jaffe, B. G. Johnson,
J. Ju, R. Kobayashi, R. Kutteh, Z. Lin, R. Littlefield, X. Long, B. Meng,
T. Nakajima, S. Niu, L. Pollack, M. Rosing, K. Glaesemann, G. Sandrone,
M. Stave, H. Taylor, G. Thomas, J. H. van Lenthe, A. T. Wong, Z. Zhang.

I have tried aug-cc-pvtz, which I think is adequate for many practical purposes, with "ROHF"; and others added into the proper groups.
I am very much afraid that the original calculation employing aug-cc-pvqz only could be successful on an excellent performance supercomputer with official NWCHEM installed in a USA national lab.

NWCHEM6.8 on MAC gave

Iterations converged
CCSD correlation energy / hartree =  ...     
CCSD total energy / hartree = ...

Singles contributions


