rohf ccsd(t) memory problems nwchem 6.3

Just Got Here

11:14:54 AM PDT - Sun, Jun 16th 2013

Hi,

I've been trying to run a ccsd(t) open shell (quintet) calculation with a rohf reference.

memory stack 1500 mb heap 100 mb global 5600 mb
echo
charge 2
geometry noautoz noautosym nocenter units angstrom
Fe       0.640891400     -0.091179700      0.459135400
O       -0.479888200     -1.229287500      0.264496600
N        0.450425300     -0.271719000      2.659391900
N       -0.942399400      1.441011600      0.380142000
N        0.889432200      0.044805500     -1.738012800
N        2.216942800     -1.650999800      0.495864000
N        2.122883200      1.462358400      0.730912200
H        1.293533800     -0.537836300      3.166108200
H        0.083470500      0.541015900      3.152989200
H       -0.726314700      2.369378300      0.019886000
H       -1.678812200      1.056483800     -0.211842200
H        1.832779500      0.116843100     -2.116630900
H        0.488247300     -0.814044200     -2.114534000
H        2.821755300     -1.690239400      1.314985400
H        1.715050300     -2.538712300      0.471196400
H        2.200921300      2.090459400     -0.068261300
H        1.930808300      2.065821800      1.529813200
H       -0.223218800     -1.022213800      2.812946500
H       -1.388183600      1.589415300      1.284661200
H        3.065616400      1.105888700      0.883979400
H        0.361202800      0.801804300     -2.170662100
H        2.843606000     -1.663079500     -0.307700500
C       -3.576064400     -3.467814800     -0.743564300
H24       -2.711758300     -2.864056000     -0.457486600
H25       -3.925529100     -3.168379800     -1.730479300
H26       -4.376115100     -3.326033100     -0.018840500
H27       -3.292576400     -4.518696300     -0.765326900
end

BASIS SPHERICAL 
Fe library def2-TZVPP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/
O library def2-TZVPP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/
C library def2-TZVPP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/
H24 library def2-TZVPP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/ 
H25 library def2-TZVP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/ 
H26 library def2-TZVP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/ 
H27 library def2-TZVP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/ 
H library def2-SVP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/ 
N library def2-SVP file /hpc/sw/nwchem-6.3-intel-impi/data/libraries/
END

scf
 thresh 1.0e-9
 tol2e 1.0e-9
 QUINTET
 ROHF
 maxiter 900
 vectors output rc.movecs
end

TCE
 SCF
 FREEZE CORE ATOMIC
 CCSD(T) 
 PRINT t1
 tilesize 35
 attilesize 40
END

TASK TCE ENERGY

Getting the following message at the end:

(rank:0 hostname:fcn19 pid:23365):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))

and a system message

Last System Error Message from Task 0:: Cannot allocate memory

The calculation was performed on 2 nodes 4 × 8-core each. The total memory is 256 GB per node. Nodes are connected via IB.

Below I paste the output.

Any suggestions appreciated.
Nick

           Job information
           ---------------

    hostname        = fcn19
    program         = nwchem
    date            = Sun Jun 16 17:28:32 2013

    compiled        = Fri_May_24_16:47:23_2013
    source          = /scratch-local/tmp.RPlNAYnxqq/nwchem-6.3-src.2013-05-17
    nwchem branch   = 6.3
    nwchem revision = 24252
    ga revision     = N/A
    input           = rc.nw
    prefix          = rc.
    data base       = ./rc.db
    status          = startup
    nproc           =       64
    time left       =     -1s



           Memory information
           ------------------

    heap     =   13107201 doubles =    100.0 Mbytes
    stack    =  196608001 doubles =   1500.0 Mbytes
    global   =  734003200 doubles =   5600.0 Mbytes (distinct from heap & stack)
    total    =  943718402 doubles =   7200.0 Mbytes
    verify   = yes
    hardfail = no

         Memory Information
            ------------------
          Available GA space size is    ********** doubles
          Available MA space size is     209690849 doubles

Global array virtual files algorithm will be used

 Parallel file system coherency ......... OK

 Integral file          = ./rc.aoints.00
 Record size in doubles =    65536    No. of integs per rec  =    32766
 Max. records in memory =      511    Max. records in file   =  2968749
 No. of bits per label  =       16    No. of bits per value  =       64


 #quartets = 1.903D+07 #integrals = 3.943D+08 #direct =  0.0% #cached =100.0%


File balance: exchanges=   384  moved=   447  time=   0.0

 
 Fock matrix recomputed
 1-e file size   =           164738
 1-e file name   = ./rc.f1             
 Cpu & wall time / sec            1.4            1.4
 
 tce_ao2e: fast2e=1
 half-transformed integrals in memory
 
 2-e (intermediate) file size =     15965585100
 2-e (intermediate) file name = ./rc.v2i
(rank:0 hostname:fcn19 pid:23365):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_pin_contig_hndl():1142 cond:(memhdl->memhndl!=((void *)0))

Just Got Here

11:46:36 PM PDT - Sun, Jun 16th 2013
tried adding io ga 2eorb 2emet 13 to TCE. In addition I've set tilesize 20 attilesize 30. Didn't help. However, playing with the #nodes/#cores revealed, that only beyond 30 nodes does the program get past the mentioned crush point and runs the CCSD iterations. Also I noticed that the speed is highest when using 2 cores per node. Running on 30 nodes with 2 cores per node doesn't seem very rational. Any clue what could be wrong when running on less than 30 nodes? Btw, for many-nodes run I used nodes with less memory: 4GB/core with max 64GB.

Gets Around

You have a few issues here:

TCE
 SCF
 FREEZE CORE ATOMIC
 CCSD(T) 
 PRINT t1
 tilesize 35
 attilesize 40
END

First, the local memory requirement (stack, not global) is tilesize^6 for (T) so you should use a tilesize of 20 or 24 to start.

Second, TCE is memory optimized for RHF and ROHF references using the 2eorb option, which will have a dramatic effect on the memory requirements, especially of the four-index transformation.

It doesn't affect the memory usage, but attilesize is only used for 2emet options greater than 1, which are themselves only in effect when 2eorb is used.

By the way, freeze core atomic is just freeze atomic. You need to provide a non-zero number to the core argument for it to do anything. The manual has details.

7:12:53 AM PDT - Tue, Jun 18th 2013

Gets Around

7:17:04 AM PDT - Tue, Jun 18th 2013
You can use fewer MPI processes per node than there are cores available if you use OpenMP threads. Currently, threads are only used - at least in TCE - in BLAS, i.e. `DGEMM`, but this is ~half the wall time in most jobs. I have threaded code for the other dominant kernels in TCE - `TCE_SORT_4` and the bottleneck portions of (T) - but it isn't in version 6.3. I will try to make a patch in a month or two.

Just Got Here

12:31:41 AM PDT - Wed, Jun 19th 2013
Hi Jhammond, Many thanks for all your explanations. Will fix the settings and test again. Quote: You can use fewer MPI processes per node than there are cores available if you use OpenMP threads. Currently, threads are only used - at least in TCE - in BLAS, i.e. DGEMM, but this is ~half the wall time in most jobs. Have to try this out. Quote: I have threaded code for the other dominant kernels in TCE - TCE_SORT_4 and the bottleneck portions of (T) - but it isn't in version 6.3. I will try to make a patch in a month or two. look forward to trying it.

Forum >> NWChem's corner >> Running NWChem