EOM-CCSDT right-hand side iterations ga create failure


Clicked A Few Times
Dear Bert/all,

even though one of my problem has been resolved (multi-CPU convergence), one another had arised. ;-(

A user, who is using our infrastructure, had reported, that the following computation crashes with "failed ga_create" error (see details below).

The computation:

start cis1

memory 10000 mb

geometry units bohr
S 0.000000000 0.073468702 -1.957631386
N 0.000000000 -1.017192265 1.304155528
O 0.000000000 0.582796470 2.868174306
H 0.000000000 2.547479474 -1.383337299
end

basis
 S library aug-cc-pVDZ
N library aug-cc-pVDZ
O library aug-cc-pVDZ
H library aug-cc-pVDZ
end

scf
singlet
rhf
end

tce
scf
ccsdt
freeze core atomic
nroots 1
targetsym a'
symmetry
thresh 1.0d-5
dipole
end

task tce energy


The computation had been run on a single node, having dedicated 4 CPUs and 100GB of memory. The CPU was Intel Xeon CPU E7- 2860 @ 2.27GHz. When computing "EOM-CCSDT right-hand side iterations", the computation did fail with the following error:


Iteration 5 using 5 trial vectors
 available GA memory             299908168  bytes
createfile: failed ga_create size=162410843
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
...
Last System Error Message from Task 0:: Illegal seek
Last System Error Message from Task 3:: Illegal seek
Last System Error Message from Task 1:: Illegal seek
Last System Error Message from Task 2:: Illegal seek
3:3:createfile: failed ga_create size=:: 162410843
(rank:3 hostname:zewura1.cerit-sc.cz pid:21865):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
1:1:createfile: failed ga_create size=:: 162410843
(rank:1 hostname:zewura1.cerit-sc.cz pid:21863):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
2:2:createfile: failed ga_create size=:: 162410843
(rank:2 hostname:zewura1.cerit-sc.cz pid:21864):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
0:0:createfile: failed ga_create size=:: 162410843
(rank:0 hostname:zewura1.cerit-sc.cz pid:21862):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0


The full computation output is available here...

Please, is there anybody, who can give me a hint, how to resolve the issue?
Thanks a lot for any advice...

--best
Tom, Czech Republic.

PS: If one needs our compilation options, see them in this thread...

Forum Vet
Tom,

These coupled cluster response calculations are large, and challenging for only 4 processors. What the error message tells you is that the code ran out of memory. There are two suggestions I can give:

1. Play with the memory configuration. 10000 mb puts 2500mb in stack, 2500mb in heap and 5000mb in global. Options would be:

    memory stack 2000mb heap 100mb global 8000mb

    memory stack 2500mb heap 100mb global 10000mb

    etc.

2. To reduce stack size requirements, you can use the "tilesize" keyword in the tce block


   tilesize 10    (or smaller)


Bert



QUOTE=Jeronimo Aug 15th 9:31 am]Dear Bert/all,

even though one of my problem has been resolved (multi-CPU convergence), one another had arised. ;-(

A user, who is using our infrastructure, had reported, that the following computation crashes with "failed ga_create" error (see details below).

The computation:

start cis1

memory 10000 mb

geometry units bohr
S 0.000000000 0.073468702 -1.957631386
N 0.000000000 -1.017192265 1.304155528
O 0.000000000 0.582796470 2.868174306
H 0.000000000 2.547479474 -1.383337299
end

basis
 S library aug-cc-pVDZ
N library aug-cc-pVDZ
O library aug-cc-pVDZ
H library aug-cc-pVDZ
end

scf
singlet
rhf
end

tce
scf
ccsdt
freeze core atomic
nroots 1
targetsym a'
symmetry
thresh 1.0d-5
dipole
end

task tce energy


The computation had been run on a single node, having dedicated 4 CPUs and 100GB of memory. The CPU was Intel Xeon CPU E7- 2860 @ 2.27GHz. When computing "EOM-CCSDT right-hand side iterations", the computation did fail with the following error:


Iteration 5 using 5 trial vectors
 available GA memory             299908168  bytes
createfile: failed ga_create size=162410843
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
...
Last System Error Message from Task 0:: Illegal seek
Last System Error Message from Task 3:: Illegal seek
Last System Error Message from Task 1:: Illegal seek
Last System Error Message from Task 2:: Illegal seek
3:3:createfile: failed ga_create size=:: 162410843
(rank:3 hostname:zewura1.cerit-sc.cz pid:21865):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
1:1:createfile: failed ga_create size=:: 162410843
(rank:1 hostname:zewura1.cerit-sc.cz pid:21863):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
2:2:createfile: failed ga_create size=:: 162410843
(rank:2 hostname:zewura1.cerit-sc.cz pid:21864):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
0:0:createfile: failed ga_create size=:: 162410843
(rank:0 hostname:zewura1.cerit-sc.cz pid:21862):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0


The full computation output is available here...

Please, is there anybody, who can give me a hint, how to resolve the issue?
Thanks a lot for any advice...

--best
Tom, Czech Republic.

PS: If one needs our compilation options, see them in this thread...

Clicked A Few Times
Dear Bert/all

Quote:Bert Aug 15th 10:16 am

Tom,

These coupled cluster response calculations are large, and challenging for only 4 processors. What the error message tells you is that the code ran out of memory. There are two suggestions I can give:

1. Play with the memory configuration. 10000 mb puts 2500mb in stack, 2500mb in heap and 5000mb in global. Options would be:

    etc.

2. To reduce stack size requirements, you can use the "tilesize" keyword in the tce block

   tilesize 10    (or smaller)


thanks a lot for your advice. However, I've tried to play with memory as well as tilesize values and hadn't succeeded to compute the above computation. I've tested on a machine having 80 CPUs and 512GB of memory -- no matter which values I do set (I've tried around 20 combinations), it fails in any of the iterations (always different for different values).

Is there somebody, who would be so kind and will try to compute the above computation, letting me know, whether it succeeded (and which values had he/she use)?

I'll appreciate it a lot...

Thanks in advance.
Tom.

Forum Vet
Tom,
I was able to run your input.
Here is the input and snippets of the output.
Cheers, Edo

start cis1
memory stack 1500 mb heap 100 mb global 2000 mb noverify

geometry units bohr
S 0.000000000 0.073468702 -1.957631386
N 0.000000000 -1.017192265 1.304155528
O 0.000000000 0.582796470 2.868174306
H 0.000000000 2.547479474 -1.383337299
end

basis spherical
S library aug-cc-pVDZ
N library aug-cc-pVDZ
O library aug-cc-pVDZ
H library aug-cc-pVDZ
end

scf
singlet
rhf
end

tce
tilesize 8
scf
ccsdt
freeze core atomic
nroots 1
targetsym a'
symmetry
thresh 1.0d-5
dipole
end
task tce energy


  date          = Wed Sep 12 16:12:45 2012

   compiled      = Wed_May_30_15:23:08_2012
source = /pic/people/edo/nwchem-6.1
nwchem branch = 6.1
input = nwchem.nw
prefix = cis1.
data base = ./cis1.db
status = startup
nproc = 256




No. of initial right vectors 1

EOM-CCSDT right-hand side iterations
--------------------------------------------------------------
Residuum Omega / hartree Omega / eV Cpu Wall
--------------------------------------------------------------

Iteration   1 using    1 trial vectors
0.6192561515519 0.3965867791628 10.79168 95.5 99.9

Iteration   2 using    2 trial vectors
0.3624777496309 0.2629666980694 7.15569 97.3 101.6

Iteration   3 using    3 trial vectors
0.1561541438865 0.2582011459083 7.02601 98.4 102.7

Iteration   4 using    4 trial vectors
0.1276825777206 0.2513184634665 6.83873 99.5 103.8

Iteration   5 using    5 trial vectors
0.1186184203433 0.2376177392172 6.46591 100.6 105.2

Iteration   6 using    6 trial vectors
0.1188635472437 0.2290289554531 6.23220 101.6 106.1

Iteration   7 using    7 trial vectors
0.1291611505929 0.2197235394909 5.97898 102.9 107.4

Iteration   8 using    8 trial vectors
0.1468880320698 0.2060810964832 5.60775 104.4 108.9

Iteration   9 using    9 trial vectors
0.1251686314517 0.1930716802431 5.25375 105.4 109.9

Iteration  10 using   10 trial vectors
0.1100535263580 0.1861806499752 5.06624 106.4 110.9

Iteration  11 using   11 trial vectors
0.1218304773119 0.1778242760306 4.83885 107.4 112.0

Iteration  12 using   12 trial vectors
0.1497078933265 0.1651713555468 4.49454 109.5 114.1

Iteration  13 using   13 trial vectors
0.1210019112423 0.1521438861275 4.14005 110.0 114.7

Iteration  14 using   14 trial vectors
0.0736948137376 0.1468017646392 3.99468 111.2 115.9

Iteration  15 using   15 trial vectors
0.0406155889687 0.1450461088342 3.94691 112.9 117.5

Iteration  16 using   16 trial vectors

Clicked A Few Times
Dear Edo,

Quote:Edoapra Sep 12th 5:43 pm

I was able to run your input.

...
No. of initial right vectors 1

EOM-CCSDT right-hand side iterations
--------------------------------------------------------------
Residuum Omega / hartree Omega / eV Cpu Wall
--------------------------------------------------------------

Iteration   1 using    1 trial vectors
...
Iteration  16 using   16 trial vectors


did the computation finish? I've checked your memory setting on a node having 80 CPUs (512GB of memory) and the computation went to iteration 16 and then did fail (in the same fashion as mentioned above). ;-( I've tried both NWChem version 6.0 and 6.1.1 -- both did fail in the same point of computation... ;-(

So, were there any other iterations or a summary, that the computation had provided?

Thank you VERY MUCH for your time to help me -- I really appreciate it.

--best
Tom Rebok.

Forum Vet
Tom
The calculation ran to completion for me.
What values of ARMCI_DEFAULT_SHMMAX are you using?
Could you please post (or put to a reachable website) you output and error files?
Edo


Forum >> NWChem's corner >> Running NWChem