I apologise for starting two threads in a single day -- but I felt my two questions, while precipitated by the same process, were different enough to warrant two separate posts. I also apologise for a long and somewhat general post -- but at this point I'm not entirely sure what to do in terms of troubleshooting.
The situation is this - because of problems with a very low ulimit on the submit node at my university coupled with the large number of procs launched by ecce (five per single job) I've been offered to use a separate submit node. It is only accessible via the standard submit node, however, so I have to use port redirect.
I've set up plenty of local nodes via ssh, without any problems, but this is the first time I can't use port 22. I've also got an SGE site set up, again via port 22. Getting stuff to work with port redirect is tougher though. I'm running all of this from a debian workstation.
Now, I've managed to make it part of the way e.g. I can set up access by editing /apps/siteconfig/remote_shells.site e.g.
ssh_p5454: ssh -p 5454
ssh_p9999: ssh -p 9999
and I've added one node and one site to my machine list. I'm having problems actually getting a job going though. The node was added more as a test case. The site is obviously the real goal here, but sorting out the node might be a sufficiently low-hanging fruit that it's worth reaping it first.
I guess I have two questions:
Is there anything obvious which I'm forgetting or doing wrong?
and
What is the corresponding default for ssh i.e. the 'built-in' commands? Are scp and xterm etc tagged on or not?
For simplicity, these are the situations:
The node:
I have access to a router, behind which sits a node with the ip address 192.168.1.106 (running debian)
I create a
ssh remoterouter -L 9999:192.168.1.106:22
In the machine browser I've got everything set up -- I've added a machine with localhost as the ip address, selected ssh_p9999 as the protocol and tested the setup by clicking on e,g, Machine Status, Disk Usage etc all confirm that it's 1) working and 2) I'm looking at the remote machine and not my actual localhost.
Next I go to organiser and try to submit a job.
Verifying remote login...
Validating local directory...
Validating job...
Generating job submission script...
Generating job monitoring configuration files...
Verifying remote directory...
Verifying scratch directory...
Verifying remote perl...
Transferring files...
Running submit script...
Starting eccejobstore...
ERROR: Unable to 'chmod u+x' submit__water-1
WARNING: Launch aborted...
Looking at the node, the directory structure (runtime directory) gets created, but it remains empty. The files don't get copied over.
My user account on the node is a standard single-user debian type account and so have few limitations.
If I click on "open" in the ECCE launcher to open a remote terminal window I get
ERROR: Opening remote shell failed: Unable to find xterm command in path
If I change to
"ssh_p9999: ssh -p 9999|xterm" that error disappears and is replaced by
Opening remote shell...
Use the command "df -k" to show disk usage.
but no shell opens. The shell ecce was launched from gives 'Failed to open remote shell on localhost (incorrect password?)' in spite of the password being correct and log in via rsa key being set up.
I've also tried tagging on scp xor scp -r but neither has any effect on whether the files get copied over.
scp and xterm work fine on the remote node.
The site
I set up the por tforwarding using
"ssh -X me@msgln4.myuniversity.edu -L 5454:gn54.myuniversity.edu:22"
In the Machine browser I set up remote access, pick my ssh_p5454 protocol (see above) then test it by clicking on
Machine Status -> Can't execute query: qstat -F
Disk Usage-> Can't execute query: df -k
Queue-> Could not find command qstat -f
The responses come reasonably quick -- so it's not timing out.
If I go to the organizer and try to launch a job I get into a situation very similar ot the case above with the node -- the directory structure gets created, but no files i.e. I get an empty folder called
~/nwchem/jobs/water
without any files in it.
Since the nfs disk is shared between the regular submit node and the portforwarded node, I can get the files up by submitting to the regular node. Restarting the job (since now the files are already present) gives:
Verifying remote login...
Validating local directory...
Validating job...
Generating job submission script...
Generating job monitoring configuration files...
Verifying remote directory...
Verifying scratch directory...
Verifying remote perl...
Transferring files...
Running submit script...
Starting eccejobstore...
ERROR: Could not find command qsub submit__water-1
Job submission output: qsub: Command not found.
CMDSTAT=1
+go+
WARNING: Launch aborted...
I've confirmed that qsub is available on the submit node.
Clicking 'open' to open a terminal gives:
"ERROR: Opening remote shell failed: Could not find command xset q"
which is fair enough -- xset isn't present on that node.
I'm using ecce v6.3.
|