Orphan processes in killed NWChem 6.1.1 job


Gets Around
We are running NWChem 6.1.1 (Jan 2012) on CentOS-6.3-x86_64 using openmpi 1.5.4 on a 16-core node. I notice that after a multi-processor job is killed, the system still claims the cpu activity and the %idle does not return to 100% as it does after a completed job, according to the output of a sar command. These cores are then not available for subsequent multi-processor jobs. The only way I know to reclaim these cores is to restart the node.

Forum Vet
What happens if you try to kill the left-over processes? Are they in "D" state?

Gets Around
Running processes after job stopped
Quote:Edoapra Feb 14th 4:12 pm
What happens if you try to kill the left-over processes? Are they in "D" state?


They are in the "R" state. Here is the "ps aux" output (keller6 is the NWChem user.)

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
keller6 6001 99.8 0.2 1262432 140092 ? R 19:25 29:12 /usr/local/nwchem/bin/nwchem input.inp
keller6 6002 99.9 0.1 1261552 125244 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
keller6 6003 99.9 0.1 1261540 128448 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
keller6 6004 99.9 0.1 1261540 122332 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp

I can kill the process with "kill -9 6001"

Gets Around
Quote:Jwkeller Feb 14th 10:59 am
We are running NWChem 6.1.1 (Jan 2012) on CentOS-6.3-x86_64 using openmpi 1.5.4 on a 16-core node. I notice that after a multi-processor job is killed, the system still claims the cpu activity and the %idle does not return to 100% as it does after a completed job, according to the output of a sar command. These cores are then not available for subsequent multi-processor jobs. The only way I know to reclaim these cores is to restart the node.


Sorry, its the NWChem 6.1.1 (Jan 2013) version, not 2012. JK

Forum Vet
Quote:Jwkeller Feb 14th 9:07 pm
Quote:Edoapra Feb 14th 4:12 pm
What happens if you try to kill the left-over processes? Are they in "D" state?


They are in the "R" state. Here is the "ps aux" output (keller6 is the NWChem user.)

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
keller6 6001 99.8 0.2 1262432 140092 ? R 19:25 29:12 /usr/local/nwchem/bin/nwchem input.inp
keller6 6002 99.9 0.1 1261552 125244 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
keller6 6003 99.9 0.1 1261540 128448 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
keller6 6004 99.9 0.1 1261540 122332 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp

I can kill the process with "kill -9 6001"


Do you still need to reboot the cluster nodes after having manually killed the leftover processes?

Edo

Gets Around
Do you still need to reboot the cluster nodes after having manually killed the leftover processes?

No - I tried several runs, and I can indeed recover full funationality of the node by finding the process numbers and then issuing n "kill -9 [process number} n times. Hopefully this can be done automatically, or prevented in the first place.
John K.

Forum Vet
Quote:Jwkeller Feb 15th 4:55 pm
Do you still need to reboot the cluster nodes after having manually killed the leftover processes?

No - I tried several runs, and I can indeed recover full funationality of the node by finding the process numbers and then issuing n "kill -9 [process number} n times. Hopefully this can be done automatically, or prevented in the first place.
John K.

John,
Using killall you should be able to kill all the nwchem associated processes. The command is
killall -9 nwchem


Another possibility is to use the openmpi orte-clean (a.k.a. ompi-clean)
http://www.open-mpi.org/doc/v1.4/man1/ompi-clean.1.php

Gets Around
Thanks Edo - This is problem in WebMO 12.1, which should insert a "scratch_dir" line when it creates the nwchem input file. Currently it is dumping all these aoints and grid files into one directory, and trying to copy those back to the user's directory on the WebMO server, rather than put them in a separate directory that is deleted after the job finishes. Apparently this is fixed in v 13 of WebMO Pro. But it is fairly easy to insert this line manually.


Forum >> NWChem's corner >> Running NWChem