Hello All,
I am trying to run an MBPT4 calculation with for 4 AMD Opteron(tm) Processor 6172, Infiniband network, system ram: 24 GiG, OS Linux ip13 2.6.32-131.12.1.el6.x86_64,
Originally I thought it was an issue with ARMCI however I increased it from the default to:
cat /proc/sys/kernel/shmmax
68719476736
The input for the run is as follows:
start dma
scratch_dir ./SCRATCH
permanent_dir ./PERMAN
- diving 1 gb of total ram per thred
- memory stack 200 mb heap 100 mb global 500 mb
memory stack 1500 mb heap 100 mb global 2000 mb
charge 0
geometry autosym
C1 0.03796436 0.18251755 0.21083836
C2 -1.14643461 -0.50977604 -0.10725952
H3 -2.07214432 0.02847624 -0.27406469
C4 -1.14958811 -1.90580585 -0.22440333
H5 -2.07916294 -2.41228569 -0.47104527
C6 0.02152990 -2.64143568 -0.03998740
H7 0.01534945 -3.72306377 -0.13518932
C8 1.20391074 -1.96152410 0.27921917
H9 2.12421717 -2.51557808 0.44441357
C10 1.21275941 -0.57360520 0.40922028
H11 2.13127414 -0.07443442 0.70080833
N12 0.07488015 1.58435483 0.39274777
C13 1.13741802 2.26805343 -0.34448643
H14 0.96254958 2.23446895 -1.43323786
H15 1.17013627 3.31302107 -0.02526109
H16 2.10797639 1.82038334 -0.13597013
C17 -1.19229534 2.27937311 0.23132106
H18 -1.57833748 2.23540223 -0.80227603
H19 -1.93886217 1.85673171 0.90697480
H20 -1.04316158 3.32826334 0.49846146
end
scf
semidirect memsize 1000000 filesize 1000
end
basis
* library 6-31G*
end
TCE
tilesize 10
attilesize 15
2eorb
2emet 14
MBPT4
end
task tce energy
Watching the resources as the job runs what I observe is that the memory allocation for each process increases during the course of the calculation until it overruns the assigned memory limits and continues on until it eats up all the available RAM. It is not clear from the error output if this is what causes the system to fail.
The error output is as follows:
WARNING: No HCA parameters were found for the HCA that Open MPI
detected:
Hostname: ip13
HCA vendor ID: 0x02c9
HCA vendor part ID: 26428
Default HCA parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_hca_param_files MCA parameter to set values for your HCA.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_hca_params_found to 0.
WARNING: No HCA parameters were found for the HCA that Open MPI
detected:
Hostname: ip13
HCA vendor ID: 0x02c9
HCA vendor part ID: 26428
Default HCA parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_hca_param_files MCA parameter to set values for your HCA.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_hca_params_found to 0.
WARNING: No HCA parameters were found for the HCA that Open MPI
detected:
Hostname: ip13
HCA vendor ID: 0x02c9
HCA vendor part ID: 26428
Default HCA parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_hca_param_files MCA parameter to set values for your HCA.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_hca_params_found to 0.
WARNING: No HCA parameters were found for the HCA that Open MPI
detected:
Hostname: ip13
HCA vendor ID: 0x02c9
HCA vendor part ID: 26428
Default HCA parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_hca_param_files MCA parameter to set values for your HCA.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_hca_params_found to 0.
3: WARNING:armci_set_mem_offset: offset changed -337894178816 to -337883676672
2: WARNING:armci_set_mem_offset: offset changed -705578352640 to -705567850496
Last System Error Message from Task 0:: No such file or directory
[ip13:31296] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 0
Last System Error Message from Task 3:: No such file or directory
[ip13:31299] MPI_ABORT invoked on rank 3 in communicator MPI_COMM_WORLD with errorcode 0
Last System Error Message from Task 1:: No such file or directory
[ip13:31297] MPI_ABORT invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 0
Last System Error Message from Task 2:: No such file or directory
[ip13:31298] MPI_ABORT invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 0
[ip13:31293] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275
[ip13:31293] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166
[ip13:31293] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[ip13:31293] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188
[ip13:31293] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198
mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS.
It looks from the output as if it fails in the 4 e integrals:
- quartets = 3.098D+06 #integrals = 5.813D+07 #direct = 95.6% #cached = 4.4%
Fock matrix recomputed
1-e file size = 24649
1-e file name = ./SCRATCH/dma.f1
Cpu & wall time / sec 5.0 5.7
4-electron integrals stored in orbital form
v2 file size = 85767682
4-index algorithm nr. 14 is used
imaxsize = 15
imaxsize ichop = 0
next size_4af ====> 185395456
size_2e ===== 85767682
Cpu & wall time / sec 393.6 394.6
do_pt = F
do_lam_pt = F
do_cr_pt = F
do_lcr_pt = F
do_2t_pt = F
T1-number-of-tasks 52
t1 file size = 4092
t1 file name = ./SCRATCH/dma.t1
t1 file handle = -998
T2-number-of-boxes 3614
t2 file size = 22383825
t2 file name = ./SCRATCH/dma.t2
t2 file handle = -996
available GA memory 1880742256 bytes
createfile: failed ga_create size=*********
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
1:1:createfile: failed ga_create size=:: -942664722
(rank:1 hostname:ip13 pid:31297):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
available GA memory 1880742264 bytes
createfile: failed ga_create size=*********
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
2:2:createfile: failed ga_create size=:: -942664722
(rank:2 hostname:ip13 pid:31298):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
available GA memory 1880745304 bytes
createfile: failed ga_create size=*********
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
3:3:createfile: failed ga_create size=:: -942664722
(rank:3 hostname:ip13 pid:31299):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
available GA memory 1880739000 bytes
------------------------------------------------------------------------
createfile: failed ga_create size=*********
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
47: task tce energy
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation
For further details see manual section:
0:0:createfile: failed ga_create size=:: -942664722
(rank:0 hostname:ip13 pid:31296):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
It appears as if either the system is not correctly allocating space to GA, or the system is not following memory limits assigned per processor in the input. Any input or suggestions would be welcome.
|