Problem with CCSD on multiple nodes


Just Got Here
One of our users at SDSC was running nwchem 6.6 with the following input file:
________________________________________________________________________
start ebz_cs
scratch_dir /oasis/scratch/comet/jpg/temp_project/

       title "Ethyl Benzene Perpendicular Cs with NWChem"
print high

       geometry units angstroms print xyz autosym
C -1.9249432259 0.0000000000 -0.5957478699
C -0.4433456144 0.0000000000 -0.3471808528
C 0.2564703070 -1.2009091386 -0.1979588379
C 0.2564703070 1.2009091386 -0.1979588379
C 1.6197063300 -1.2049921394 0.0916889986
C 1.6197063300 1.2049921394 0.0916889986
C 2.3052419437 0.0000000000 0.2383389259
H -0.2723325992 -2.1405186828 -0.3149888186
H -0.2723325992 2.1405186828 -0.3149888186
H 2.1460344900 -2.1448708409 0.1988894664
H 2.1460344900 2.1448708409 0.1988894664
H 3.3643305934 0.0000000000 0.4602195673
H -2.1970685263 0.8781900301 -1.1844842813
H -2.1970685263 -0.8781900301 -1.1844842813
C -2.7066810912 0.0000000000 0.7196106585
H -3.7820558813 0.0000000000 0.5410146553
H -2.4551753627 -0.8807948291 1.3105739301
H -2.4551753627 0.8807948291 1.3105739301
end

basis
  • library 6-311G*
end

scf
maxiter 40
thresh 1.0e-8
profile
end

ccsd
maxiter 40
thresh 1.0e-7
freeze atomic
end

task ccsd(t)
________________________________________________________________________

When he tries to run on more then one node on the comet cluster (comet.sdsc.edu; 24 cpus/node; 128MB/noded; 2 nodes; http://www.sdsc.edu/support/user_guides/comet.html) the CCSD program gives the following error message:



iter     correlation     delta       rms       T2     Non-T2      Main
energy energy error ampl ampl Block
time time time



Wrote g_t2 to ./ebz_cs.t2

g_st2 size:      56 MB
mem. avail 198 MB
Memory based method: ST2 is allocated
ST2 array is replicated      0.07s
1 -1.1389441747 -1.139D+00 1.182D-10 29.70 0.01 27.78

Wrote g_t2 to ./ebz_cs.t2

g_st2 size:      56 MB
mem. avail 198 MB
Memory based method: ST2 is allocated
ST2 array is replicated      0.05s
8: error ival=4
(rank:8 hostname:comet-14-10.sdsc.edu pid:20772):ARMCI DASSERT fail. ../../ga-5-4/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
[cli_8]: aborting job:
application called MPI_Abort(comm=0x84000001, 1) - process 8
[comet-14-10.sdsc.edu:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 9. MPI process died?
[comet-14-10.sdsc.edu:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[comet-14-10.sdsc.edu:mpispawn_1][child_handler] MPI process (rank: 8, pid: 20772) exited with status 1
Connection to comet-14-09 closed by remote host.^M
Connection to comet-14-10 closed by remote host.^M

Assuming it is a memory issue, I tried cutting down on the number of cores per node, but no luck,

Thanks,

Jerry


Forum >> NWChem's corner >> Running NWChem