job getting killed


Clicked A Few Times
Hi users,

I am running scf calculation with B3LYP/cc-pvtz. job ends abruptly after 6 iterations.

convergence    iter        energy       DeltaE   RMS-Dens  Diis-err    time
---------------- ----- ----------------- --------- --------- --------- ------
d= 0,ls=0.0,diis 1 -2213.9715078291 -8.41D+03 3.90D-03 1.04D+01 2734.6
d= 0,ls=0.0,diis 2 -2213.5586821483 4.13D-01 2.10D-03 1.84D+01 4017.2
d= 0,ls=0.0,diis 3 -2215.1492051354 -1.59D+00 4.64D-04 5.20D-01 5300.0
d= 0,ls=0.0,diis 4 -2215.1882090985 -3.90D-02 1.32D-04 9.59D-02 6584.5
d= 0,ls=0.0,diis 5 -2215.1973602169 -9.15D-03 5.31D-05 5.11D-03 7889.0
Resetting Diis
d= 0,ls=0.0,diis 6 -2215.1977626980 -4.02D-04 2.39D-05 2.10D-03 9194.5
0: error ival=4
(rank:0 hostname:node19.local pid:12786):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_call_data_server():2189 cond:(pdscr->status==IBV_WC_SUCCESS)
rank 0 in job 1 node19.local_60513 caused collective abort of all ranks
 exit status of rank 0: killed by signal 9 



how should we fix this?

Thanks
Karteek Kumar

Forum Vet
Karteek,
The error message you got comes out the Infiniband communication layer.
It might be due to a temporary connectivity problem on your cluster.
Did you get this same error more than once in a reproducible way?

Cheers, Edo

Clicked A Few Times
Thanks Edo,

Yeah sir,
This keeps coming frequently, not only one cluster, its coming on two clusters.

Thanks
Karteek Kumar

Forum Vet
Does it occur in a reproducible way, that is always at the same point in your calculation?
Edo

Clicked A Few Times
I never restarted from the beginning,

I have restarted the job where it stopped. since scf cycle converged within less time, i cant see the error message again for the same job.

Just Got Here
i have also same problem..


Forum >> NWChem's corner >> Running NWChem