2:31:06 PM PDT - Mon, Apr 17th 2017 |
|
I have a long running job (one I didn't expect to run so long, so did not generate restart files). This job is using the tensor contraction engine to do an IP-EOM computation. In the midst of this job one of my nodes just suffered a random reboot (no clue why yet). There are no recent messages in the log file. The last message is one about a broken pipe to the node that failed. I had expected to see the whole computation shut down after that. Instead, all the remaining processes on the other nodes appear to still be running full bore.
My question is, are the other processes actually still OK or should I shut down the computation and start over?
I'm hoping somebody knows about the fault tolerance off the top of their heads, since it will take me a long time to dig through the code to figure it out.
Thanks,
Jonathan
|