ECCE launches five processes per job: hitting ulimit.


Gets Around
Hi,
I've been using ECCE to manage nwchem and gaussian jobs on a couple of clusters and it's been a good experience. However, I recently got access to a new university cluster using SGE and found that I was being locked out quite frequently. Turns out that the number of concurrent processes allowed on the head node is limited to 32.

Just make myself clear: obviously, the true, underlying problem is an overly strict ulimit -- and this isn't something I'm faulting ECCE for.

Looking at what happens when ECCE submits a job, five processes are created and put to sleep on the head node:
  sshd: user [priv]
sshd: user@notty
bash -c echo +hi+ && csh -i
-sh -i
perl eccejobmonitor -configFile eccejobmonitor.conf -jobId X -bookmark Y

I don't have a computing background and so may be a bit naïve, but here are my two questions:
Are all these processes necessary? If they aren't, is there an easy 'fix' by editing e.g. a submit script?

If all these procs aren't required, taking into account the (overly strict) limits that are imposed on the number of user-associated procs on some computational clusters may be something worth doing in future versions of ECCE. One head node proc per launched job would be ideal.

Again, thank you for all the great work!

EDIT: I'm using ecce v6.3.

Gets Around
The number of processes created is definitely not something that is easily changed. It's part of the core design of ECCE remote communications. While there is some waste there that could be reduced/streamlined, I think there will always be 3 or 4 processes per job launch. The part that could be streamlined would be related to how ECCE creates a csh shell. That's just because our remote communications logic uses csh syntax going back to the 1990's so a couple days of software development could eliminate that since csh has become much less popular over the years. With a ulimit of 32 on processes though, that's not going to be a guaranteed fix to your problem. Plus, there isn't the funding available to make the required code changes now. As an aside, when ECCE goes open source later this summer then certainly we'd love user contributions to the code base and source code was already released with ECCE 6.3 so someone intent on cleaning up the remote communications could already dive into it. For now though I'll try to help with your other ssh port redirect issue instead.

Gary

Gets Around
Hi Gary,
I figured as much and I felt a bit guilty asking about it since it really is an IT issue on this side -- you easily use 3-5 procs in a simple cat/sed/gawk sequence so 32 is a tad bit low.
Cheers for you prompt answer though.

I'll post my replies to the ssh port redirect thread when I get the troubleshooting data.


Forum >> ECCE: Extensible Computational Chemistry Environment >> General ECCE Topics