memory problem in parallel running "ARMCI DASSERT fail"


Click here for full thread
Clicked A Few Times
I did series of Coupled Cluster testing calculations of FH2 using two nodes connected with infiniband, with 12 cores and 48GB memory per node.
For rather small basis and tasks, such as UCCSDT/avdz and UCCSD/vtz, everything is OK, but for tasks requiring more than 500MB memory occurs the problem.

The input file is :

start fh2

scratch_dir ./tmp

memory heap 300 mb stack 300 mb global 3000 mb

geometry units au
    H       -0.466571969    0.000000000   -3.498280516
    H        0.624505061    0.000000000   -2.532671944
    F       -0.008378972    0.000000000    0.319965748
end

basis noprint
 * library cc-pvdz # or aug-cc-pvdz or others
end

SCF
  semidirect
  DOUBLET
  UHF
  THRESH 1.0e-10
  TOL2E  1.0e-10
END

TCE
  SCF
  CCSD # or CCSDT or CCSDTQ
END

TASK TCE ENERGY


When I do UCCSD/avtz calculations, the Hartree Fock part is OK, but terminated at CC as below:
            Memory Information  
            ------------------   
          Available GA space size is    9437161950 doubles
          Available MA space size is      78639421 doubles

 Maximum block size        36 doubles

 tile_dim =     35

 Block   Spin    Irrep     Size     Offset   Alpha
 -------------------------------------------------
   1    alpha     a'     5 doubles       0       1
   2    alpha     a"     1 doubles       5       2
   3    beta      a'     4 doubles       6       3
   4    beta      a"     1 doubles      10       4
   5    alpha     a'    34 doubles      11       5
   6    alpha     a'    34 doubles      45       6
   7    alpha     a"    31 doubles      79       7
   8    beta      a'    34 doubles     110       8
   9    beta      a'    35 doubles     144       9
  10    beta      a"    31 doubles     179      10
   
 Global array virtual files algorithm will be used

 Parallel file system coherency ......... OK

 Integral file          = ./tmp/fh2.aoints.00
 Record size in doubles =  65536        No. of integs per rec  =  43688
 Max. records in memory =     15        Max. records in file   =   2287
 No. of bits per label  =      8        No. of bits per value  =     64

   
 #quartets = 1.396D+05 #integrals = 8.008D+06 #direct =  0.0% #cached =100.0%


File balance: exchanges=    12  moved=    15  time=   0.0
 
 
 Fock matrix recomputed
 1-e file size   =            12706
 1-e file name   = ./tmp/fh2.f1    
 Cpu & wall time / sec            0.2            1.1

 tce_ao2e: fast2e=1
 half-transformed integrals in memory

 2-e (intermediate) file size =       279803475
 2-e (intermediate) file name = ./tmp/fh2.v2i
 Cpu & wall time / sec            1.8            2.3

 tce_mo2e: fast2e=1
 2-e integrals stored in memory

 2-e file size   =        119972997
 2-e file name   = ./tmp/fh2.v2
 Cpu & wall time / sec           10.0           10.5
 do_pt =  F
 do_lam_pt =  F
 do_cr_pt =  F
 do_lcr_pt =  F
 do_2t_pt =  F
 T1-number-of-tasks                     6

 t1 file size   =              678
 t1 file name   = ./tmp/fh2.t1
 t1 file handle =       -998
 T2-number-of-boxes                    38

 t2 file size   =           368230
 t2 file name   = ./tmp/fh2.t2
 t2 file handle =       -995

 CCSD iterations
 -----------------------------------------------------------------
 Iter          Residuum       Correlation     Cpu    Wall    V2*C2
 -----------------------------------------------------------------
0: error ival=4
(rank:0 hostname:compute-10-15.local pid:19142):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
12: error ival=4
(rank:12 hostname:compute-10-1.local pid:9867):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
rank 0 in job 8  i10-15_52820   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9


For UCCSDTQ/vdz calculations, it terminated at the third iteration of CC:
 2-e file size   =           386290
 2-e file name   = ./tmp/fh2.v2
 Cpu & wall time / sec            0.4            0.4
 do_pt =  F
 do_lam_pt =  F   
 do_cr_pt =  F    
 do_lcr_pt =  F   
 do_2t_pt =  F    
 T1-number-of-tasks                     6
   
 t1 file size   =              140      
 t1 file name   = ./tmp/fh2.t1          
 t1 file handle =       -998
 T2-number-of-boxes                    38

 t2 file size   =            14660
 t2 file name   = ./tmp/fh2.t2
 t2 file handle =       -995

 t3 file size   =          1160539
 t3 file name   = ./tmp/fh2.t3
2: WARNING:armci_set_mem_offset: offset changed 794624 to 9244672
3: WARNING:armci_set_mem_offset: offset changed 0 to 8450048
6: WARNING:armci_set_mem_offset: offset changed 794624 to 8450048 
8: WARNING:armci_set_mem_offset: offset changed 794624 to 8450048  
13: WARNING:armci_set_mem_offset: offset changed 0 to -620834816     

 t4 file size   =         78188214
 t4 file name   = ./tmp/fh2.t4             

 CCSDTQ iterations
 --------------------------------------------------------
 Iter          Residuum       Correlation     Cpu    Wall
 --------------------------------------------------------
    1   0.2682660632262  -0.1813353786615    86.8    89.1
    2   0.0920127385001  -0.1943555090903    87.5    89.9
0: error ival=4
(rank:0 hostname:compute-10-15.local pid:19656):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
12: error ival=4
(rank:12 hostname:compute-10-1.local pid:10202):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193 cond:(pdscr->status==IBV_WC_SUCCESS)
application called MPI_Abort(comm=0x84000003, 1) - process 0
rank 0 in job 13  i10-15_52820   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9


If only one node was used everything is also OK. It seems that the actually available memory becomes limited if parallel with multiple nodes.
I also checked the maximum shared memory, which is nearly 36GB:
cat /proc/sys/kernel/shmmax
37976435712


The compiling setenvs are:
setenv LARGE_FILES TRUE
setenv LIB_DEFINES "-DDFLT_TOT_MEM=16777216"
setenv NWCHEM_TOP /work2/nwchem-6.1.1
setenv NWCHEM_TARGET LINUX64
setenv ENABLE_COMPONENT yes
setenv TCGRSH /usr/bin/ssh
setenv USE_MPI "y"
setenv USE_MPIF "y"
setenv USE_MPIF4 "y"
setenv MPI_LOC /work2/intel/impi/4.1.0.024/intel64
setenv MPI_LIB ${MPI_LOC}/lib
setenv MPI_INCLUDE ${MPI_LOC}/include
setenv LIBMPI "-lmpigf -lmpigi -lmpi_ilp64 -lmpi"
setenv IB_HOME /usr
setenv IB_INCLUDE $IB_HOME/include
setenv IB_LIB $IB_HOME/lib64
setenv IB_LIB_NAME "-libverbs -libumad -lpthread -lrt"
setenv ARMCI_NETWORK OPENIB
setenv PYTHONHOME /usr
setenv PYTHONVERSION 2.4
setenv USE_PYTHON64 "y"
setenv CCSDTQ yes
setenv CCSDTLR yes
setenv NWCHEM_MODULES "all python"
setenv MKLROOT /work1/soft/intel/mkl/10.1.2.024
setenv BLASOPT "-L${MKLROOT}/lib/em64t -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm"
setenv FC "ifort -i8 -I${MKLROOT}/include"
setenv CC "icc -DMKL_ILP64 -I${MKLROOT}/include"
setenv MSG_COMMS MPI



How can I deal with this error? Did anyone get the similar problem?
Any suggestions are welcome.