ECCE and SLURM batch system


Clicked A Few Times
Our supercomputer administrators have recently switched from PBS to SLURM. For now they are supporting PBS submissions to SLURM, but do not know their long-term plans for it. How difficult is it to add a new queueing system? Where do I find the scripts to make it happen?

Matthew Asplund

Clicked A Few Times
Follow-up to my own post
I have edited the QueueManager file to create a new SLURM set of commands, but I am mostly not certain if I have to edit something to make parsing the output data from the SLURM commands work.

Matthew Asplund

Gets Around
Matthew,
let me know how it goes. I'm (slowly) working on setting slurm on my cluster (debian jessie doesn't package SGE anymore) and will try to get ECCE working with it.

Gets Around
I've set up slurm on my cluster and have configured ECCE to work with it. See here: [1]

It works, but can probably be improved upon.

Clicked A Few Times
I actually edited the submit.site file to add explicit support for SLURM by adding the lines to the file


172 SLURM {
173 #SBATCH --time=$wallTime
174 #SBATCH --ntasks=$totalprocs
175 #SBATCH --nodes=$nodes
176 #SBATCH -C 'avx'
177 #SBATCH --mem-per-cpu=4096M
178 }

I am still having problems with job monitoring, so I will try putting your changes to eccejobmonitor to my installation.

Gets Around
Matt,
the key to getting the job monitoring to work is to edit apps/scripts/eccejobmonitor
Beware that $q contains the name of the queue manager in lower case, regardless of how you've defined it in QueueManagers

Other than that, it was pretty straightforward (setting up SLURM itself was a bigger challenge), and I've been using it for day and a bit now without issue.

Clicked A Few Times
So, I stopped playing with this, but am getting back to it. My problem right now is that I am getting an error "Unable to parse job id. Cannot monitor job." when I submit things. Now, when I run the sbatch command to submit a job, it returns output "Submitted batch job 9488438" (or whatever the job ID is). I tried writing a wrapper script to reduce the output to just the job id, but that didn't help. Is there a way to track what is actually happening during the submit process? I tried setting the ECCE_DEBUG and ECCE_RCOM_LOGMODE but that just outputs the ssh communication.

Gets Around
Add a
#SBATCH --output=slurm.out

line so that messages get logged.

I read it as submission failing i.e. the jobs never run?

Log onto the submit node and run the submit_xxxxxx file manually. See what happens and if it runs. You might be able to narrow it down to either communication issues or something to do with slurm.

Clicked A Few Times
Actually, the jobs submit and run just fine, but I get an error
ERROR: Unable to parse job id. Cannot monitor job.
WARNING: Launch aborted...

So, it is in the submit step that things are failing.


Forum >> ECCE: Extensible Computational Chemistry Environment >> General ECCE Topics