Memory Problem in TDDFT Calculation using NWChem 6.1.1


Just Got Here
Dear All, I tried to perform a TDDFT calculation on a molecule of 124 atoms with 488 electrons at b3lyp/6-31g* level (1158 AOs). The task has been running on a workstation equipped with 4-cpu/48-cores and 128 GB of RAM (and also 5 HDDs of 2 TB each). This is the input deck I used:

echo
START PB_166
TITLE "TDDFT B3LYP/6-31G* PB_166"
memory heap 100 mb stack 1000 mb global 1000 mb
geometry units angstroms
CARBON         3.2593623488  -1.8892593764  -0.0894787160
CARBON -3.2593623488 1.8892593764 -0.0894787160
CARBON -2.7926481739 0.5667699406 -0.2553721756
CARBON 2.7926481739 -0.5667699406 -0.2553721756
CARBON -1.3944736870 0.2917225452 -0.1570899000
CARBON 1.3944736870 -0.2917225452 -0.1570899000
CARBON 0.4695910920 -1.3813690147 -0.0063436639
CARBON -0.4695910920 1.3813690147 -0.0063436639
CARBON 0.9773255072 -2.6733511057 0.2570667557
CARBON -0.9773255072 2.6733511057 0.2570667557
CARBON 2.3692061833 -2.8995829680 0.1993350845
CARBON -2.3692061833 2.8995829680 0.1993350845
CARBON -0.9589992126 -1.0752484664 -0.2194695571
CARBON 0.9589992126 1.0752484664 -0.2194695571
CARBON -3.7127238835 -0.4700449582 -0.5234397154
CARBON 3.7127238835 0.4700449582 -0.5234397154
CARBON -3.2563500645 -1.7634899097 -0.6915626710
CARBON 3.2563500645 1.7634899097 -0.6915626710
CARBON -1.9002769128 -2.0634516737 -0.5278175573
CARBON 1.9002769128 2.0634516737 -0.5278175573
CARBON -4.6985971253 2.2158772668 -0.1773315252
CARBON 4.6985971253 -2.2158772668 -0.1773315252
NITROGEN -5.5640030722 1.1437720602 -0.4611081692
NITROGEN 5.5640030722 -1.1437720602 -0.4611081692
CARBON -5.1616118913 -0.2042474749 -0.6488670978
CARBON 5.1616118913 0.2042474749 -0.6488670978
CARBON -6.9901907994 1.4291743262 -0.5696580305
CARBON 6.9901907994 -1.4291743262 -0.5696580305
CARBON -7.5226614334 1.7756191846 -1.8232994341
CARBON 7.5226614334 -1.7756191846 -1.8232994341
CARBON -8.8970755269 2.0263340006 -1.9099279667
CARBON 8.8970755269 -2.0263340006 -1.9099279667
CARBON -9.7109450781 1.9358613719 -0.7835536110
CARBON 9.7109450781 -1.9358613719 -0.7835536110
CARBON -9.1611152029 1.6059619342 0.4527283456
CARBON 9.1611152029 -1.6059619342 0.4527283456
CARBON -7.7918634839 1.3459009154 0.5818634389
CARBON 7.7918634839 -1.3459009154 0.5818634389
CARBON 0.1428367894 -3.8409708502 0.6659093569
CARBON -0.1428367894 3.8409708502 0.6659093569
CARBON 0.1747853200 -5.0403770944 -0.0687302078
CARBON -0.1747853200 5.0403770944 -0.0687302078
CARBON -0.5658948364 -6.1416651146 0.3379799031
CARBON 0.5658948364 6.1416651146 0.3379799031
CARBON -1.3510424808 -6.0815073593 1.4994059874
CARBON 1.3510424808 6.0815073593 1.4994059874
CARBON -1.3754632525 -4.8970247768 2.2456454811
CARBON 1.3754632525 4.8970247768 2.2456454811
CARBON -0.6353726211 -3.7916743849 1.8292129681
CARBON 0.6353726211 3.7916743849 1.8292129681
CARBON -7.1802943259 1.0744724459 1.9550249719
CARBON 7.1802943259 -1.0744724459 1.9550249719
CARBON -6.9922187839 2.4291238870 2.6916012476
CARBON 6.9922187839 -2.4291238870 2.6916012476
CARBON -8.0153384136 0.0888412047 2.8027367733
CARBON 8.0153384136 -0.0888412047 2.8027367733
CARBON -6.6213059790 1.9719338033 -3.0408348761
CARBON 6.6213059790 -1.9719338033 -3.0408348761
CARBON -7.2188988461 1.3782266105 -4.3361458258
CARBON 7.2188988461 -1.3782266105 -4.3361458258
CARBON -6.3321915977 3.4897769504 -3.2029782515
CARBON 6.3321915977 -3.4897769504 -3.2029782515
OXYGEN -2.0487336823 -7.2306317435 1.8148895163
OXYGEN 2.0487336823 7.2306317435 1.8148895163
CARBON -2.8862683489 -7.2244397281 3.0125202682
CARBON 2.8862683489 7.2244397281 3.0125202682
OXYGEN -5.9868713061 -1.0894713451 -0.8999664913
OXYGEN 5.9868713061 1.0894713451 -0.8999664913
OXYGEN 5.1306971980 -3.3698489594 -0.0115046054
OXYGEN -5.1306971980 3.3698489594 -0.0115046054
HYDROGEN 2.7610226683 -3.8910556460 0.3887768618
HYDROGEN -2.7610226683 3.8910556460 0.3887768618
HYDROGEN -3.9739725269 -2.5377333729 -0.9313730981
HYDROGEN 3.9739725269 2.5377333729 -0.9313730981
HYDROGEN -1.5777173701 -3.0864013999 -0.6291200936
HYDROGEN 1.5777173701 3.0864013999 -0.6291200936
HYDROGEN -9.3303033301 2.2988264405 -2.8642220154
HYDROGEN 9.3303033301 -2.2988264405 -2.8642220154
HYDROGEN -10.7738923321 2.1319777109 -0.8674824826
HYDROGEN 10.7738923321 -2.1319777109 -0.8674824826
HYDROGEN -9.7989108611 1.5547033918 1.3263264928
HYDROGEN 9.7989108611 -1.5547033918 1.3263264928
HYDROGEN 0.7737081962 -5.0979816565 -0.9705805093
HYDROGEN -0.7737081962 5.0979816565 -0.9705805093
HYDROGEN -0.5581945138 -7.0665670931 -0.2230100521
HYDROGEN 0.5581945138 7.0665670931 -0.2230100521
HYDROGEN -1.9611691692 -4.8275038497 3.1521817279
HYDROGEN 1.9611691692 4.8275038497 3.1521817279
HYDROGEN -0.6615299883 -2.8791033556 2.4126597947
HYDROGEN 0.6615299883 2.8791033556 2.4126597947
HYDROGEN -6.1919261949 0.6310951637 1.8034301693
HYDROGEN 6.1919261949 -0.6310951637 1.8034301693
HYDROGEN -6.4820413902 2.2767312976 3.6503269121
HYDROGEN 6.4820413902 -2.2767312976 3.6503269121
HYDROGEN -6.4043222673 3.1117586012 2.0713120330
HYDROGEN 6.4043222673 -3.1117586012 2.0713120330
HYDROGEN -7.9694465516 2.8860288974 2.8843527574
HYDROGEN 7.9694465516 -2.8860288974 2.8843527574
HYDROGEN -7.4954764107 -0.1188951546 3.7450872602
HYDROGEN 7.4954764107 0.1188951546 3.7450872602
HYDROGEN -8.9960342771 0.5102250480 3.0481768046
HYDROGEN 8.9960342771 -0.5102250480 3.0481768046
HYDROGEN -8.1638216525 -0.8546151965 2.2673842319
HYDROGEN 8.1638216525 0.8546151965 2.2673842319
HYDROGEN -5.6722477685 1.4646077386 -2.8454093509
HYDROGEN 5.6722477685 -1.4646077386 -2.8454093509
HYDROGEN -6.4988451482 1.4822194466 -5.1558116842
HYDROGEN 6.4988451482 -1.4822194466 -5.1558116842
HYDROGEN -7.4520691544 0.3164879704 -4.2060273377
HYDROGEN 7.4520691544 -0.3164879704 -4.2060273377
HYDROGEN -8.1339532424 1.9056419745 -4.6259165433
HYDROGEN 8.1339532424 -1.9056419745 -4.6259165433
HYDROGEN -5.6227350238 3.6592272990 -4.0218459547
HYDROGEN 5.6227350238 -3.6592272990 -4.0218459547
HYDROGEN -7.2627913122 4.0217761026 -3.4307826487
HYDROGEN 7.2627913122 -4.0217761026 -3.4307826487
HYDROGEN -5.9187339480 3.8925177666 -2.2738657114
HYDROGEN 5.9187339480 -3.8925177666 -2.2738657114
HYDROGEN -3.3307439871 -8.2188446571 3.0482985312
HYDROGEN 3.3307439871 8.2188446571 3.0482985312
HYDROGEN -2.2922203656 -7.0516634523 3.9177694867
HYDROGEN 2.2922203656 7.0516634523 3.9177694867
HYDROGEN -3.6767177395 -6.4675081919 2.9461738576
HYDROGEN 3.6767177395 6.4675081919 2.9461738576
end
BASIS
* library 6-31G*
END
DFT
direct
iterations 100
XC B3LYP
vectors input PB_166.movecs
END
TDDFT
NROOTS 30
notriplet
thresh 1.0d-05
maxiter 200
END
task tddft energy
permanent_dir /media/DSK_1/pernwc
scratch_dir /media/DSK_2/scrnwc

After the completion of the scf cycles, as soon as the the code enters the TDDFT procedure, the following error occurs:

ARMCI INFO
The application attempted to allocate a shared memory segment of 5045551104 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.

0:allocate: failed to create shared region : -1
(rank:0 hostname:gundam pid:10579):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:armci_allocate():1117 cond:0

I was running the task using 24 cores having specified kernel.shmmax = 8589934592 (8192 MB)
and ARMCI_DEFAULT_SHMMAX=4096 (I also tried with ARMCI_DEFAULT_SHMMAX=8192 but nothing changes).


If someone can put me in the right direction, I really appreciate.
Thanks a lot,
Davide

Clicked A Few Times
I have the same problem, so I thought would post my problem here instead of creating a new post.

I'm trying to do a TDDFT (B3LYP/6-31G**) calculation on a relatively large molecule (134) but I keep getting the error shown above. I'm running the job on 24 nodes each with 8 cores and 12 GB memory. Here is my input file:
start oct 
scratch_dir /scr/NWCHEM
title "DFT OPT / TDDFT"
memory heap 100 mb stack 500 mb global 500 mb
geometry units angstrom noautoz noautosym nocenter
 C                 15.71888000   -0.68451000   -0.31755000
 C                 26.58441000   -0.66861000    0.22900000
.
.
.
 H                 75.99113296    0.45964825   -0.44180755
 H                 75.41213659    1.77885775   -1.43050230
 H                 75.50769908    1.96384452    0.30435051
end
basis
 * library 6-31G**
end
dft
 direct
 iterations 300
 xc B3LYP
end
driver
maxiter 2000
end
#task dft optimize
 
TDDFT
 NROOTS 20
 SINGLET
 TRIPLET
END
task TDDFT ENERGY


I have even tried 50 nodes but still get the error. I believe this calculation should be possible...

Forum Vet
Davide,

The code clearly tries to allocate more then 8 Gbyte of shared memory, and in your NWChem input you ask for a maximum up to 24x1Gbyte of shared memory to be available.

One thing you can try is to see if the calculation runs with a maximum of 8 cores per node, which would force the calculation to stay within the shemmax limit you have set.

Alternatively, you could increase the kernel.shmmax to 24 Gbyte. This will probably lead to the code generating multiple shared memory segments of the size ARMCI_DEFAULT_SHMMAX. The development version is addressing this issue, and another issue I will come back to with Mef362.

Mef262,

What are your kernel.shmmax and ARMCI_DEFAULT_SHMMAX settings? In your case, you should have the kernel.shmmax set to over 4 Gbyte and have ARMCI_DEFAULT_SHMMAX=4096.

Both,

We have seen some large memory usage in the TDDFT in NWChem 6.1 versions. We have addressed this in the TDDFT module and the global array parallel infrastructure of the development version that is being prepared for release as we speak.

Bert

Clicked A Few Times
I got the same error as Davide on a system with the following settings

16 cores with 32 GB memory per node
kernel.shmmax = 1073741824
ARMCI_DEFAULT_SHMMAX=8092

I tried even using 100 nodes.


I tried on a different cluster with the following settings:

8 cores with 12 GB memory per node
kernel.shmmax = 68719476736
ARMCI_DEFAULT_SHMMAX=8092

and get the following error:

  Entering Davidson iterations
  Restricted singlet excited states
 
  Iter   NTrls   NConv    DeltaV     DeltaE      Time   
  ----  ------  ------  ---------  ---------  --------- 
    1     20       0     0.21E+00   0.10+100      354.2
    2     60       2     0.81E-01   0.82E-02      654.3
    3     94       0     0.77E-01   0.66E-02      604.7
    4    134       0     0.39E-01   0.90E-02      763.2
    5    174       0     0.89E-01   0.42E-02      746.7
    6    214       0     0.90E-01   0.10E-01      760.4
    7    254       0     0.86E-01   0.11E-01      776.5
0:Terminate signal was sent, status=: 15
(rank:0 hostname:rs538 pid:32755):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0


I also get lots of the following errors in my slurm file:

                  1510
                   142  rep  failed on Work product                     ndim 
                     3  dims                     37                  1510
                  1510
                   136  rep  failed on Work product                     ndim 
                     3  dims                     37                  1510
                  1510
                    76  rep  failed on Work product                     ndim 
                     3  dims                     37                  1510
                  1510
                   140  rep  failed on Work product                     ndim 
                     3  dims                     37                  1510
                  1510

Last System Error Message from Task 68:: Numerical result out of range
Last System Error Message from Task 71:: Numerical result out of range
Last System Error Message from Task 66:: Numerical result out of range
Last System Error Message from Task 69:: Invalid argument
Last System Error Message from Task 67:: Numerical result out of range
Last System Error Message from Task 70:: Numerical result out of range
Last System Error Message from Task 64:: Numerical result out of range
Last System Error Message from Task 80:: Numerical result out of range
.
.
.
Last System Error Message from Task 106:: Numerical result out of range

Clicked A Few Times
When do you think the new version will be available?

Also, I'm willing to be a Beta tester if needed; mainly because I hear the new version should work on Qlogic IB.

Just Got Here
Thank you very much Bert for your kind reply.
I will try both your suggestions (running up on just only 8 cores as well as increasing the kernel.shmmax to 24 Gbyte or more).
If nothing changes in a positive way, I will wait for the new release which adresses this issue.
All the best,

     Davide

Quote:Bert Apr 17th 2:47 pm
Davide,

The code clearly tries to allocate more then 8 Gbyte of shared memory, and in your NWChem input you ask for a maximum up to 24x1Gbyte of shared memory to be available.

One thing you can try is to see if the calculation runs with a maximum of 8 cores per node, which would force the calculation to stay within the shemmax limit you have set.

Alternatively, you could increase the kernel.shmmax to 24 Gbyte. This will probably lead to the code generating multiple shared memory segments of the size ARMCI_DEFAULT_SHMMAX. The development version is addressing this issue, and another issue I will come back to with Mef362.

Mef262,

What are your kernel.shmmax and ARMCI_DEFAULT_SHMMAX settings? In your case, you should have the kernel.shmmax set to over 4 Gbyte and have ARMCI_DEFAULT_SHMMAX=4096.

Both,

We have seen some large memory usage in the TDDFT in NWChem 6.1 versions. We have addressed this in the TDDFT module and the global array parallel infrastructure of the development version that is being prepared for release as we speak.

Bert

Just Got Here
Dear All, I was experimenting a little bit with different settings for kernel.shmmax
and ARMCI_DEFAULT_SHMMAX values. I used the same input deck reported in the 1st post
but the simulation was developped just using only 8 cores (as suggested by Bert) for all the tests.

Test 1): kernel.shmmax = 8589934592 (8192 MB), ARMCI_DEFAULT_SHMMAX=8192

The error obtained (after the dft-scf procedure) is:
                                      • ARMCI INFO ************************
The application attempted to allocate a shared memory segment of 4294967296 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.

0:allocate: failed to create shared region : -1
(rank:0 hostname:gundam pid:12926):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:armci_allocate():1117 cond:0

Test 2): kernel.shmmax = 25769803776 (24576 MB), ARMCI_DEFAULT_SHMMAX=8192

The error obtained (after the dft-scf procedure) is:

                                      • ARMCI INFO ************************
The application attempted to allocate a shared memory segment of 4294967296 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.

0:allocate: failed to create shared region : -1
(rank:0 hostname:gundam pid:12981):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:armci_allocate():1117 cond:0

Test 3): kernel.shmmax = 25769803776 (24576 MB), ARMCI_DEFAULT_SHMMAX=24576

The error obtained (after the 1st Davidson Iteration in the TDDFT procedure) is:

 Iter   NTrls   NConv    DeltaV     DeltaE      Time   
---- ------ ------ --------- --------- ---------
1 30 0 0.30E+00 0.10+100 3627.0
ga_create_atom_blocked: gdens1
------------------------------------------------------------------------
ga_create_atom_blocked: ga_create_irreg 1158

 current input line : 
154: task tddft energy
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
0:0:ga_create_atom_blocked: ga_create_irreg:: 1158
(rank:0 hostname:gundam pid:13005):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
ga_create_atom_blocked: ga_create_irreg      1158

 current input line : 
0:
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
2:2:ga_create_atom_blocked: ga_create_irreg:: 1158
ga_create_atom_blocked: ga_create_irreg      1158
current input line :
0:
For more information see the NWChem manual at
ga_create_atom_blocked: ga_create_irreg 1158
current input line :
0:
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section
ga_create_atom_blocked: ga_create_irreg 1158
current input line :
0:
For more information see the NWChem manual at
ga_create_atom_blocked: ga_create_irreg 1158

(rank:2 hostname:gundam pid:13007):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
3:3:ga_create_atom_blocked: ga_create_irreg:: 1158
(rank:3 hostname:gundam pid:13008):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
4:4:ga_create_atom_blocked: ga_create_irreg:: 1158
(rank:4 hostname:gundam pid:13009):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
6:6:ga_create_atom_blocked: ga_create_irreg:: 1158
 current input line : 
0:
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
1:1:ga_create_atom_blocked: ga_create_irreg:: 1158
ga_create_atom_blocked: ga_create_irreg      1158
current input line :
0:
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
(rank:6 hostname:gundam pid:13011):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
(rank:1 hostname:gundam pid:13006):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
5:5:ga_create_atom_blocked: ga_create_irreg:: 1158
ga_create_atom_blocked: ga_create_irreg      1158
current input line :
(rank:5 hostname:gundam pid:13010):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
    0: 
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
7:7:ga_create_atom_blocked: ga_create_irreg:: 1158
(rank:7 hostname:gundam pid:13012):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0

All the best,
                   Davide

Forum Vet
The error on the last message reflects that you are running out of global memory. You could now try to increase the number of processors (maybe double) and see.

Bert

Just Got Here
Dear Bert, I tried to use 16 cores (instead of 8) but the job ended just after the scf-dft cycles with the "first type" of error:
                                      • ARMCI INFO ************************
The application attempted to allocate a shared memory segment of 3157917696 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.

0:allocate: failed to create shared region : -1
(rank:0 hostname:gundam pid:13850):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:armci_allocate():1117 cond:0

Thank you very much for your support.

    Davide

Quote:Bert Apr 18th 6:08 pm
The error on the last message reflects that you are running out of global memory. You could now try to increase the number of processors (maybe double) and see.

Bert


Forum >> NWChem's corner >> General Topics