Solved: ECCE using non-standard ssh ports -- port redirection.


Gets Around
I apologise for starting two threads in a single day -- but I felt my two questions, while precipitated by the same process, were different enough to warrant two separate posts. I also apologise for a long and somewhat general post -- but at this point I'm not entirely sure what to do in terms of troubleshooting.

The situation is this - because of problems with a very low ulimit on the submit node at my university coupled with the large number of procs launched by ecce (five per single job) I've been offered to use a separate submit node. It is only accessible via the standard submit node, however, so I have to use port redirect.

I've set up plenty of local nodes via ssh, without any problems, but this is the first time I can't use port 22. I've also got an SGE site set up, again via port 22. Getting stuff to work with port redirect is tougher though. I'm running all of this from a debian workstation.

Now, I've managed to make it part of the way e.g. I can set up access by editing /apps/siteconfig/remote_shells.site e.g.
ssh_p5454: ssh -p 5454
ssh_p9999: ssh -p 9999

and I've added one node and one site to my machine list. I'm having problems actually getting a job going though. The node was added more as a test case. The site is obviously the real goal here, but sorting out the node might be a sufficiently low-hanging fruit that it's worth reaping it first.

I guess I have two questions:
Is there anything obvious which I'm forgetting or doing wrong?
and
What is the corresponding default for ssh i.e. the 'built-in' commands? Are scp and xterm etc tagged on or not?

For simplicity, these are the situations:




The node:
I have access to a router, behind which sits a node with the ip address 192.168.1.106 (running debian)
I create a
ssh remoterouter -L 9999:192.168.1.106:22

In the machine browser I've got everything set up -- I've added a machine with localhost as the ip address, selected ssh_p9999 as the protocol and tested the setup by clicking on e,g, Machine Status, Disk Usage etc all confirm that it's 1) working and 2) I'm looking at the remote machine and not my actual localhost.

Next I go to organiser and try to submit a job.
Verifying remote login...
Validating local directory...
Validating job...
Generating job submission script...
Generating job monitoring configuration files...
Verifying remote directory...
Verifying scratch directory...
Verifying remote perl...
Transferring files...
Running submit script...
Starting eccejobstore...
ERROR: Unable to 'chmod u+x' submit__water-1
WARNING: Launch aborted...

Looking at the node, the directory structure (runtime directory) gets created, but it remains empty. The files don't get copied over.

My user account on the node is a standard single-user debian type account and so have few limitations.

If I click on "open" in the ECCE launcher to open a remote terminal window I get
ERROR: Opening remote shell failed:  Unable to find xterm command in path


If I change to
"ssh_p9999: ssh -p 9999|xterm" that error disappears and is replaced by
Opening remote shell...
Use the command "df -k" to show disk usage.

but no shell opens. The shell ecce was launched from gives 'Failed to open remote shell on localhost (incorrect password?)' in spite of the password being correct and log in via rsa key being set up.

I've also tried tagging on scp xor scp -r but neither has any effect on whether the files get copied over.

scp and xterm work fine on the remote node.



The site
I set up the por tforwarding using
"ssh -X me@msgln4.myuniversity.edu -L 5454:gn54.myuniversity.edu:22"

In the Machine browser I set up remote access, pick my ssh_p5454 protocol (see above) then test it by clicking on
Machine Status -> Can't execute query: qstat -F
Disk Usage-> Can't execute query: df -k
Queue-> Could not find command qstat -f

The responses come reasonably quick -- so it's not timing out.

If I go to the organizer and try to launch a job I get into a situation very similar ot the case above with the node -- the directory structure gets created, but no files i.e. I get an empty folder called
~/nwchem/jobs/water
without any files in it.

Since the nfs disk is shared between the regular submit node and the portforwarded node, I can get the files up by submitting to the regular node. Restarting the job (since now the files are already present) gives:

Verifying remote login...
Validating local directory...
Validating job...
Generating job submission script...
Generating job monitoring configuration files...
Verifying remote directory...
Verifying scratch directory...
Verifying remote perl...
Transferring files...
Running submit script...
Starting eccejobstore...
ERROR: Could not find command qsub submit__water-1
Job submission output: qsub: Command not found.
CMDSTAT=1
+go+
WARNING: Launch aborted...


I've confirmed that qsub is available on the submit node.

Clicking 'open' to open a terminal gives:
"ERROR: Opening remote shell failed: Could not find command xset q"
which is fair enough -- xset isn't present on that node.
I'm using ecce v6.3.

Gets Around
Since you are experiencing what potentially looks like several remote communications issues, I think you would benefit greatly from enabling the logging of all the underlying commands that ECCE issues to launch a job. This is pretty much the first step whenever something isn't working right related to remote communications in ECCE. If you look at the $ECCE_HOME/siteconfig/site_runtime file you'll see all the variables that ECCE allows for customizing behavior including some for debugging like logging remote communication. There is also documentation for how to use these variables. So editing the site_runtime file is one way to enable this logging.

In this case though it's probably easier just to manually set the variable needed and then run ECCE rather than changing the site_runtime file. The environment variable you want to set is $ECCE_RCOM_LOGMODE and you'll want to set the value to "true". In csh it would be "setenv ECCE_RCOM_LOGMODE true" and for sh/bash it would be "export ECCE_RCOM_LOGMODE=true". Then in the same shell you'll want to start ECCE so that it sees this new variable definition. Then whenever you do something in ECCE that requires remote communication, you'll see what ECCE does behind the scenes sent to the terminal window where ECCE was started, which can be a lot of data. If you can't scroll back to the start of the output you will want to start ecce inside a "script" session so that a file is created with all the remote communications output. Hopefully this will help in figuring out what is going wrong. I'd work on "the node" issue first since it seems like you are having more success there.

There is also a way to specify for the file transfer to use the existing ssh connection for file transfer instead of a separate scp comment, which may prove useful. If you look at the $ECCE_HOME/siteconfig/CONFIG.chinook file you'll see this "singleConnect" variable being set to the value of "check". In your case you'd want to edit your CONFIG.<machine> file and set the value of this variable to "true". This will remove the separate scp file transfer step. But, you may be able to fix this issue without resorting to that because the $ECCE_RCOM_LOGMODE setting should give you more information on what is happening now.

Gary

Gets Around
When you set $ECCE_RCOM_LOGMODE then you should also see the exact ssh/scp commands ECCE is issuing. Between that and seeing what is being sent over those ssh/scp connections you can also try to reproduce what is happening manually if it is not clear why ECCE is having a problem. That's another common debugging technique we've employed over the years. This is one of the "big 3" for problems that come up using ECCE. Those are:

1. OpenGL graphics configuration issues (mostly alleviated in recent years by using Mesa software OpenGL as the default configuration)
2. Compute resource registration for machines running batch queue schedulers
3. Remote communication issues

When I look at the first problem you describe under "the node" section of your original post, I'm guessing that something isn't working related to the scp file copy because of your customized remote_shells.site file. But, I think we made ECCE smart enough to substitute in the command line arguments you specified for ssh when it runs scp. I wouldn't be surprised though if something related to that broke over the years. I can't remember the last time we tried using the remote_shell.site capability--definitely many years ago.

Gary

Gets Around
SCP and submit solution
Hi Gary.
I turned on ECCE_RCOM_LOGMODE and managed to sort out the scp problem and with that got things running.
Thank you so much for pointing me in the right direction! It turns out that you made ECCE a bit too smart...

The original problems were
1. I can't copy the files to remote site/node and
2. No remote xterm opens

SCP problem: Turns out scp wants upper-case P and doesn't accept lower-case p. ssh of course want lower-case p and doesn't accept upper-case P.

This is from a succesful run on a local node using port 22
arg 0: scp
arg 1: -r
arg 2: /tmp/ecce_andy/jobs/water__8PsJsN/nwch.nw
arg 3: /tmp/ecce_andy/jobs/water__8PsJsN/submit__water
arg 4: /tmp/ecce_andy/jobs/water__8PsJsN/eccejobmonitor.conf
arg 5: /tmp/ecce_andy/jobs/water__8PsJsN/nwchem.desc
arg 6: /home/andy/.ecce/ecce-6.3/apps/scripts/eccejobmonitor
arg 7: andy@borax:/home/andy/mine/testing/old/water
end remote copy command


This is an unsuccesful run using a node via localhost -p 9999
with this in the ecce-6.3/apps/siteconfig/remote_shells.site file
ssh_p9999: ssh -p 9999

and this is what happens:
arg 0: scp
arg 1: -p
arg 2: 9999
arg 3: /tmp/ecce_andy/jobs/water__bLmRNx/nwch.nw
arg 4: /tmp/ecce_andy/jobs/water__bLmRNx/submit__water
arg 5: /tmp/ecce_andy/jobs/water__bLmRNx/eccejobmonitor.conf
arg 6: /tmp/ecce_andy/jobs/water__bLmRNx/nwchem.desc
arg 7: /home/andy/.ecce/ecce-6.3/apps/scripts/eccejobmonitor
arg 8: andy@localhost:/home/andy/jobs/testing/old/water


Forgetting about the -r switch for a second here, the latter command fails when I try:
scp -p 9999 /tmp/ecce_andy/jobs/water__bLmRNx/nwch.nw andy@localhost:/home/andy/jobs/testing/old/water
scp: /home/andy/jobs/testing/old/water: No such file or directory


which is true, there isn't on localhost (but should be on localhost -p 9999). That directory structure is on the remote node (runtime: /home/andy/jobs + testing/old/water) but for some reason the -p switch is completely ignored by scp (the files on the ecce workstation is on ~/.ecce/ecce-6.3/server/data/Ecce/users/andy/testing/old/water/).

However,
scp -P 9999 /tmp/ecce_andy/jobs/water__bLmRNx/nwch.nw andy@localhost:/home/andy/jobs/testing/old/water
nwch.nw                                                                                                                                                                                                    100%  476     0.5KB/s   00:00 

works.

The solution is thus to change to:
ssh_p9999: ssh -p 9999|scp -P 9999

I can now submit jobs to the remote debian node (p 9999)!




Getting the remote site to work for submission took a little longer and boiled down to a silly little thing (in addition to the -P issue):
I had qsub in the path name like this:
 NWChem: /opt/sw/nwchem-6.1/bin/nwchem
 Gaussian-03: /usr/local/bin/G09
 perlPath: /usr/bin/perl
 qmgrPath: /opt/n1ge62/bin/lx24-amd64/qsub

while it should really be
 NWChem: /opt/sw/nwchem-6.1/bin/nwchem
 Gaussian-03: /usr/local/bin/G09
 perlPath: /usr/bin
 qmgrPath: /opt/n1ge62/bin/lx24-amd64



On a related note, I'm not sure what corresponding lines I'm to look for when parsing the logging output from using "singleConnect: check", but it's not quite as relevant anymore.

See here for the full setup using port redirection:
[blog http://verahill.blogspot.com.au/2012/05/port-redirection-with-eccenwchem.html]



Remaining problem:
This really isn't that important anymore since I'm more concerned with launching jobs.

However, here it is for completeness:
I still can't open any remote terminals, and I've also tried with e.g.
' ssh_p9999: ssh -XC -p 9999|scp -P 9999


I'm not sure why, as running
xterm -title "oxygen" -bg "#b7b8ba" -fg "#000000" -sb -e csh -c "cd ~ && $SHELL"

opens a window as it is supposed to.

Creating remote shell:
machine (localhost)
remote shell (ssh_p9999)
local shell (csh)
user name (andy)
password is 0 characters
Remote shell command:
arg 0: ssh
arg 1: -XC
arg 2: -p
arg 3: 9999
arg 4: -v
arg 5: -o
arg 6: ForwardX11=yes
arg 7: -l
arg 8: andy
arg 9: localhost
arg 10: echo
arg 11: +hi+
arg 12: &&
arg 13: csh
arg 14: -i
End remote shell command
OpenSSH_5.9p1 Debian-5, OpenSSL 1.0.1c 10 May 2012
Authenticated to localhost ([127.0.0.1]:9999).
unalias precmd; set prompt=+go+; unset echo
+hi+
% Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
+go+unalias *
+go+date; echo CMDSTAT=$status
Wed May 30 20:06:23 PDT 2012
CMDSTAT=0
+go+exit; echo GOODBYE
GOODBYE
+go+exit
Transferred: sent 2952, received 2576 bytes, in 1.1 seconds
Bytes per second: sent 2622.8, received 2288.7
Creating remote shell:
machine (localhost)
remote shell (ssh_p9999)
local shell (csh)
user name (andy)
password is 0 characters
Remote shell command:
arg 0: ssh
arg 1: -XC
arg 2: -p
arg 3: 9999
arg 4: -v
arg 5: -o
arg 6: ForwardX11=yes
arg 7: -l
arg 8: andy
arg 9: localhost
arg 10: echo
arg 11: +hi+
arg 12: &&
arg 13: csh
arg 14: -i
End remote shell command
OpenSSH_5.9p1 Debian-5, OpenSSL 1.0.1c 10 May 2012
Authenticated to localhost ([127.0.0.1]:9999).
unalias precmd; set prompt=+go+; unset echo
+hi+
% Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
+go+unalias *
+go+if ($?PATH) setenv PATH /usr/bin/perl:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}
+go+if ($?PATH == 0) setenv PATH /usr/bin/perl:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11
+go+date; echo CMDSTAT=$status
Wed May 30 20:06:32 PDT 2012
CMDSTAT=0
+go+echo $DISPLAY; echo CMDSTAT=$status
localhost:11.0
CMDSTAT=0
+go+xset q; echo CMDSTAT=$status
Keyboard Control:
  auto repeat:  on    key click percent:  0    LED mask:  00000002
  XKB indicators:
    00: Caps Lock:   off    01: Num Lock:    on     02: Scroll Lock: off
    03: Compose:     off    04: Kana:        off    05: Sleep:       off
    06: Suspend:     off    07: Mute:        off    08: Misc:        off
    09: Mail:        off    10: Charging:    off    11: Shift Lock:  off
    12: Group 2:     off    13: Mouse Keys:  off
  auto repeat delay:  500    repeat rate:  33
  auto repeating keys:  00ffffffdffffbbf
                        fadfffeffffdffff
                        9fffffffffffffff
                        fff7ffffffffffff
  bell percent:  50    bell pitch:  400    bell duration:  100
Pointer Control:
  acceleration:  1/1    threshold:  5
Screen Saver:
  prefer blanking:  yes    allow exposures:  yes
  timeout:  0    cycle:  0
Colors:
  default colormap:  0x20    BlackPixel:  0    WhitePixel:  16777215
Font Path:
  /usr/share/fonts/X11/misc,/usr/share/fonts/X11/100dpi/:unscaled,/usr/share/fonts/X11/75dpi/:unscaled,/usr/share/fonts/X11/Type1,/usr/share/fonts/X11/100dpi,/usr/share/fonts/X11/75dpi,/var/lib/defoma/x-ttcidfont-conf.d/dirs/TrueType,built-ins,/usr/share/fonts/X11/misc,/usr/share/fonts/X11/100dpi/:unscaled,/usr/share/fonts/X11/75dpi/:unscaled,/usr/share/fonts/X11/Type1,/usr/share/fonts/X11/100dpi,/usr/share/fonts/X11/75dpi,/var/lib/defoma/x-ttcidfont-conf.d/dirs/TrueType,built-ins
DPMS (Energy Star):
  Standby: 0    Suspend: 0    Off: 0
  DPMS is Enabled
  Monitor is On
CMDSTAT=0
+go+echo $PATH; echo CMDSTAT=$status
/usr/bin/perl:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
CMDSTAT=0
+go+if (-x  channel 1/xterm) echo TRUE
+go+if: Expression Syntax.
Creating remote shell:
machine (localhost)
remote shell (ssh_p9999)
local shell (csh)
user name (andy)
password is 0 characters
Remote shell command:
arg 0: ssh
arg 1: -XC
arg 2: -p
arg 3: 9999
arg 4: -v
arg 5: -o
arg 6: ForwardX11=yes
arg 7: -l
arg 8: andy
arg 9: localhost
arg 10: echo
arg 11: +hi+
arg 12: &&
arg 13: csh
arg 14: -i
End remote shell command
OpenSSH_5.9p1 Debian-5, OpenSSL 1.0.1c 10 May 2012
Authenticated to localhost ([127.0.0.1]:9999).
unalias precmd; set prompt=+go+; unset echo
+hi+
% Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
+go+unalias *
+go+date; echo CMDSTAT=$status
Wed May 30 20:07:17 PDT 2012
CMDSTAT=0
+go+exit; echo GOODBYE
GOODBYE
+go+exit
Transferred: sent 2952, received 2592 bytes, in 1.1 seconds
Bytes per second: sent 2626.2, received 2305.9
Creating remote shell:
machine (localhost)
remote shell (ssh_p9999)
local shell (csh)
user name (andy)
password is 0 characters
Remote shell command:
arg 0: ssh
arg 1: -XC
arg 2: -p
arg 3: 9999
arg 4: -v
arg 5: -o
arg 6: ForwardX11=yes
arg 7: -l
arg 8: andy
arg 9: localhost
arg 10: echo
arg 11: +hi+
arg 12: &&
arg 13: csh
arg 14: -i
End remote shell command

Gets Around
Wow, you made a lot of good progress in a short time. From checking man pages for ssh and scp it looks like the rationale for not being consistent with -p/-P is that they preferred to be compliant with "cp" command usage of "-p" to preserve modification times of copied files. Then I would have (wrongly) guessed that they would have gone with uppercase -P for the ssh command and got back consistency on both fronts. Or, they could support either -p or -P for the port with ssh and not lost anything. Anyway, we have never had to specify the port with ssh/scp for a custom remote shell or we would have had the same problem you did. I think your approach is definitely the right one (separate the ssh and scp commands).

In regards to the "singleConnect" directive usage, you'd actually want to do "singleConnect: true" rather than "singleConnect: check". The latter is a special case where it sometimes does file transfer via ssh and sometimes does a separate scp command. To figure out which it "checks" (hence the cryptic value for the variable) if the machine you are going to from your ECCE client host is on the same domain or not. If it is on a different domain then it uses a shared/single ssh connection for file transfer. If it is the same domain then it does a separate command. So I'm guessing in your case it is on the same domain and therefore you would never see anything different than if you had never used the singleConnect directive. Setting the value to "true" will force it to share a single ssh connection and you should see this in the $ECCE_RCOM_LOGMODE output (the lack of an scp command being issued and instead there is a "dd" command used that echoes out the files to perform the transfer). The "check" value is useful in the case where users from outside the local domain are treated different from those inside and this is the case for the EMSL chinook cluster. Outside requires a "one-time login" credential" in addition to a password where inside only the password is needed. So, we really wanted to avoid prompting the user more than a single time for this one-time credential to do a job launch (even though technically it would work just fine--just seems a little strange and annoying which is something we strive to avoid with ECCE) and therefore came up with this strategy. We've also found for bigger files that doing the ssh/dd based file transfer isn't as reliable as an scp command or else we would just switch ECCE over to never do scp commands since then users would never have to be concerned with scp command syntax.

That's a lot of output you included for your remaining xterm issue (I know of course that's due to how verbose ECCE is when doing remote operations). However, I see one potential part of that output that I think could be related to your problem. Do you see the line like:

if (-x channel 1/xterm) echo TRUE

and then the next line says it is an "if expression" syntax error? That to me indicates it is bailing just before trying to invoke the xterm. The reason is that it doesn't think the xterm command exists and the reason for that is this failed "if" command. The question then becomes where this "channel 1" part of the path is coming from. Do you having anything in your CONFIG.<machine> file that
looks like that? Clearly the space between "channel" and "1" is not what would be expected for a valid expression and if we can figure that out, I bet we can get remote xterms and tail commands working for you.

Thanks for linking in your blog on making NWChem/ECCE work for you. I'm very impressed with your resourcefulness and the lengths you've gone to--you have persevered through the adversity where others would have bailed. I just scanned through most of it rather than picking up detail. One thing that I did notice is your issues with OpenGL where you suggested moving the shared libraries to another directory. While that's perfectly workable, this would be another instance where consulting the $ECCE_HOME/siteconfig/site_runtime file would be useful. There you would learn about the $ECCE_MESA_OPENGL and $ECCE_MESA_EXCEPT variables that control whether to use the ECCE-supplied GL libraries or native ones (e.g. hardware OpenGL card drivers) on your machine.

Gary

Gets Around
By the way, in looking back at your older ECCE posts on your blog, on May 7 you had issues upgrading from ECCE 6.2 to 6.3 in regards to ECCE not being able to find scripts it needed to generate input files such as creating basis sets. Your solution was the manual way for something that is a basic part of ECCE setup for users (maybe you've since figured this out). There is a $ECCE_HOME/scripts/runtime_setup.sh sh/bash environment setup script that you can invoke to set up the paths as needed. This is documented in the list of steps needed when you install ECCE after it is done extracting the distribution right before the install finishes. When you actually invoke ecce then the rest of the environment (such as putting the scripts/parsers directory in the path) is done by the ecce_env script.

Another feature that may or may not be useful to you with this special node that is setup with a higher ulimit for submitting your ECCE jobs is that ECCE has a "hop" feature that lets it go from a main login node on a machine to other nodes before actually running commands (e.g. submitting jobs). If you look at the $ECCE_HOME/siteconfig/CONFIG-Examples/CONFIG.mpp2 file, you'll see this "frontendMachine" directive that is what is used to do this. I'm thinking this might allow you to skip the port redirect options with ssh and just "hop" to your special node from regular login node on the compute host. But, I don't think I'd worry about it if what you have now is working fine.

Gary

Gets Around
Gary,
thanks for all that information and help!

I have to admit that I've been somewhat lax in reading the documentation. To some extent I blame the presence of somewhat outdated information on the EMSL site, but mostly it's a real oversight from my part and I will amend this in the future.

In particular the main node-sub node hop part is very interesting since it'd be much easier to set up for non-technical users. Luckily, I'm about to help another research group set up ECCE management of their cluster so will have the opportunity to explore that in more detail -- I'll post my experience on the blog and will link here once it's up.

EDIT: Here's how to use node hopping [blog http://verahill.blogspot.com.au/2012/06/ecce-and-inaccessible-cluster-nodes.html]. The example shows how to access a node on a cluster directly without SGE, but modifying it for the remote site example that precipitated my first set of questions is just as easy.

Which brings me to a somewhat related item:
I understand that ECCE is going full open source this summer, which sounds like a great thing, in particular if the community picks it up and become active contributors. I also understand that there hasn't been adequate funding (if any) for maintaining the documentation of ECCE for many years, and I'd suspect that this has been a bit of a hurdle in terms of adoption. Certainly this won't change. However, has any thought been given to the possibility of creating a wiki-type documentation for ECCE? The general documentation of NWChem on this site is halfways there, but editing isn't open to forum members. I'm sure there's a lot of people like myself who run into problems, find a solution and would like to share it with future/other users. Creating a post on a forum isn't quite the same thing, since you'd primarily do that if you have a question rather than a solution. I also acknowledge that policing posts can potentially become a real problem very quickly if the restrictions are too lax.

Finally, what's the preferred way of communicating minor bugs at the moment? Or is everything on hold until the open source transition? Two things that come to mind are:
  • In the ECCE NWChem editor, if you look at the Theory overview (just what's listed in the overview i.e. not clicking on Details) choosing a DFT method doesn't change "SCF Max Iterations" to "DFT Max iterations", even though this is what you actually set when you click on Details/SCF Max Iterations -- i.e. there's a bit mislabelling going on. Secondly, there should be a separate field for SCF maxiter when doing DFT since 30 cycles often isn't enough for reasonably large, inorganic species. It's easily compensated for by editing the input file, and I suspect (wildly guessing) that editing the codereg/*.py files may solve this.

  • If you're setting up an MD simulation: if (in the viewer/builder) you draw the backbone of the molecule first and then add the protons by hand AMBER will sort out the atom types automatically. If you hit "Add H" the H's are inserted between the backbone molecule in the atoms list i.e. instead of
1 C 0 0 0
2 C 0 1 0
3 H 0 -1 0
4 H 0 20
you get
1 C 0 0 0
2 H 0 -1 0
3 C 0 1 0
4 H 0 20
and AMBER doesn't know what to do half the time in terms of assigning atom types (I've had the problem with oxygen atoms in particular). I'm not saying that automatically assigning atomtypes is a safe thing to do, but it's an observation.

I apologise for all the unrelated questions above -- but I didn't feel they warranted separate threads.

Anyway, again thank you for all the help!

Gets Around
Definitely both the http://ecce.pnl.gov website contains some dated information as well as the context based help available from ECCE applications. But, both should still largely be relevant. The best place by far to get updated information on ECCE is by reading the release notes that are updated with each new version of ECCE. Extra time is spent on those precisely because the rest of the documentation primarly dates back several years. The release notes are available from the website and I always include a link to them when I sent out a release announcemnt to the ECCE user list.

For the open source release of ECCE it is a matter of getting the final approvals needed as the ECCE 6.3 release was the bulk of the technical work including a source code build environment and script to automate the build. ECCE is written primarily in C++ for the core applications and then there is a good chunk of both perl and python scripting for the parts like code registration and compute host and batch queue scheduler registration. So the C++ part of it might limit contributions compared to something like NWChem developed in Fortran primarily. As far as wiki documentation, that gets back to (lack of) funding if the documentation is populated here. Of course a template for community generated documentation could be done quite easily, but that working out and thriving would depend on the size and activity of the ECCE user community.

The wiki forum here is the spot to report errors and all other support requests. Your first item looks like it might be a pretty simple fix and I'll try to take a look at that. You are right that those should just be some python script changes and not C++ core application code. The second item I'm not as certain if there would be a simple fix since I'm not as familiar with the underlying code (as well as the remote communications code and job launching/monitoring, I wrote the electronic structure calculation editor application including the code registration framework).

Gary

Gets Around
xterm channel issue
Hi Gary,
I've grepped my way through all the files and no sign of any [Cc]hannel anywhere. I think that might be hard-coded since it goes away if you specify xterm in the remote_shells.site file.

At the beginning of these examples I'm using
ssh_p9999: ssh -XC -p 9999|scp -P 9999


In terms of diagnostics, here's a successful xterm opening (node on my LAN, port 22):
DPMS (Energy Star):
  Standby: 0    Suspend: 0    Off: 0
  DPMS is Enabled
  Monitor is On
CMDSTAT=0
echo $PATH; echo CMDSTAT=$status
/usr/bin/perl:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
CMDSTAT=0
+go+if (-x /usr/bin/perl/xterm) echo TRUE
+go+if (-x /bin/xterm) echo TRUE
+go+if (-x /usr/sbin/xterm) echo TRUE
+go+if (-x /sbin/xterm) echo TRUE
+go+if (-x /usr/X11R6/bin/xterm) echo TRUE
+go+if (-x /usr/bin/X11/xterm) echo TRUE
TRUE
+go+exit; echo GOODBYE
GOODBYE
+go+exit


and here's a (new) unsuccessful one of the same section (there's definitely a time-out issue going on from the freeze/delay when hitting 'Open'):
DPMS (Energy Star):
  Standby: 0    Suspend: 0    Off: 0
  DPMS is Enabled
  Monitor is On
CMDSTAT=0
+go+echo $PATH; echo CMDSTAT=$status
/usr/bin/perl:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
CMDSTAT=0
+go+if (-x  channel 1/xterm) echo TRUE
if: Expression Syntax.
+go+if (-x  free/xterm) echo TRUE
+go+if (-x  x11, nchannels 2
/usr/bin/perl/xterm) echo TRUE
+go+Too many ('s.
+go+Too many )'s.


I know: it doesn't help that it's a bit different from what it was before. The Too many ('s looks like a smoking gun. But we can get rid of that too by chucking xterm at the end of the shell definition line (whether I use /usr/bin/X11/xterm or just xterm doesn't matter)

Anyway, both systems are debian, both systems have /usr/bin/X11/xterm

So, I changed to
ssh_p9999: ssh -XC -p 9999| scp -P 9999| /usr/bin/X11/xterm

which got me
/usr/bin/X11/xterm: No absolute path found for shell: andy
Failed to open remote shell on localhost (incorrect password?)


More info from the unsuccessful run:
Remote shell command:
arg 0: /usr/bin/X11/xterm
arg 1: -l
arg 2: andy
arg 3: localhost
arg 4: echo
arg 5: +hi+
arg 6: &&
arg 7: csh
arg 8: -i
End remote shell command
/usr/bin/X11/xterm: No absolute path found for shell: andy


And therein lies a problem as
/usr/bin/X11/xterm -l andy localhost

gives
/usr/bin/X11/xterm: No absolute path found for shell: andy


Going back to a successful example there's no similar Remote shell command block. Instead
Running remote background command with args:
command (/usr/bin/X11/xterm -title "tantalum" -bg "#b7b8ba" -fg "#000000" -sb -e)
args (csh -c "cd ~ && $SHELL")
machine (tantalum)
remote shell (ssh)
user name (andy)
password is 0 characters
Running remote background command:
command (/usr/bin/X11/xterm -title "tantalum" -bg "#b7b8ba" -fg "#000000" -sb -e csh -c "cd ~ && $SHELL")


Basically, it seems like the way xterm is called in the unsuccessful case is malformed.

I think I'm a bit closer after actually reading the remote_shells.site file
#kerberos: rsh|rcp -r|rxterm -l ##user## -x ##-e command## ##machine##|kauth -h ##machine##

Looking at that, and rearranging (can't have -l user first)
ssh_p9999: ssh -p 9999| scp -P 9999|/usr/bin/X11/xterm -e csh -i -l andy oxygen 

gave this on trying to do 'tail -f'
Running remote command:
command (tail -f /home/andy/jobs/testing/coacac_II_lanl2dz/nwch.nwout)
machine (localhost)
remote shell (/usr/bin/X11/xterm -e csh -i -l andy oxygen)
local shell (csh)
user name (andy)
password is 0 characters
Creating remote shell:
machine (localhost)
remote shell (/usr/bin/X11/xterm -e csh -i -l andy oxygen)
local shell (csh)
user name (andy)
password is 0 characters
Remote shell command:
arg 0: /usr/bin/X11/xterm -e csh -i -l andy oxygen
arg 1: -l
arg 2: andy
arg 3: localhost
arg 4: echo
arg 5: +hi+
arg 6: &&
arg 7: csh
arg 8: -i
End remote shell command
Failed to open remote shell on localhost (incorrect password?)


(I think in theory the following should be enough:
/usr/bin/X11/xterm -e csh -i 

Finally, I've omitted the -x seen in the kerberos example since it's not recognised.)

While looking ugly, this actually works if executed in a shell on the remote node:
/usr/bin/X11/xterm -e csh -i-l andy oxygen -l andy localhost 

as does this
/usr/bin/X11/xterm -e csh -i-l andy oxygen -l andy localhost -c "echo +hi+ && csh -i"

It actually also works when executed on the ecce workstation, so it's not a matter of executing it in the wrong place.

I'm stumped.

The only other thing I could imagine is the 'remote shell (ssh_p999)' looking a bit funny in
Creating remote shell:
machine (localhost)
remote shell (ssh_p9999)
local shell (csh)
user name (andy)
password is 0 characters
Remote shell command:
arg 0: ssh
arg 1: -XC
arg 2: -p
arg 3: 9999
arg 4: -v
arg 5: -o
arg 6: ForwardX11=yes
arg 7: -l
arg 8: andy
arg 9: localhost
arg 10: echo
arg 11: +hi+
arg 12: &&
arg 13: csh
arg 14: -i
End remote shell command
OpenSSH_5.9p1 Debian-5, OpenSSL 1.0.1c 10 May 2012
Authenticated to localhost ([127.0.0.1]:9999).


But I really don't know. I'll update if I come up with a solution.

Gets Around
Andy (I assume that's your real name and not just a user name),

I decided to remove the code that checks the directory where xterm is found and just attempt to run the command. Hopefully that helps with part of your problem at least since it will no longer be doing the checks that led to a syntax error in your case. I haven't pushed out this new ECCE 6.3 version yet.

I was looking into the issue with the SCF and DFT max. iterations. I see that right now ECCE never includes an SCF block if the user specfies a DFT level of theory. But, I take it that we actually should have both an SCF and DFT block in this case? If that's true, do we need "task" statements for both or are the task statements just for the DFT part? If there is a task statement for the SCF block, what should it look like? Right now, for example, if you are doing a DFT geometry optimization, it is "task dft optimize". Would there also be a preceeding "task scf optimize" in this case. If you are also doing a Vibration/Frequency analysis in addition to an optimization (i.e. the GeoVib runtype in ECCE), does this mean you also need a "task scf freq" giving a total of 4 task statements with 2 each for SCF and DFT? There are some other subtleties as well--with SCF we use the nopen statement, but then with DFT it switches to mult (when it's not a singlet). If we have both SCF and DFT blocks, would we have both of these? I was actually planning to fix this issue by adding a new "DFT max. iterations" field in the DFT section of the theory details GUI. Then no name changes would be necessary. But, looking at how other fields like the convergence algorithm are put in the DFT block if that's the theory although it's under SCF in the GUI, I'm wondering if that's the right approach.

Gary

Gets Around
Morning Gary,
Ideally both the number of scf and dft iterations should be modifiable -- in particular as you almost always have to at least double the default number of scf iterations for reasonably interesting inorganic molecules.

SCF doesn't need it's own task statement -- I guess dft implies scf. As far as I have experienced, task scf is only ever used when (r/u/ro)HF is explicitly used.

You would have both a scf and a dft block since they do different things. So, the only GUI changes needed are to
1. edit the names of the fields when you do dft, since changing what is labelled as 'scf maxiter' really changes the dft 'iterations' statement.
2. add some of the scf fields for dft calcs.

The logical order would be scf first, then dft.

The one question which is harder to answer (and which you ask) is what scf fields to include -- as you point out you set mult in the dft block, while if you're doing only scf you can use nopen or a string ('doublet') instead. I don't have a good answer. I don't think we need to bother about it though.

Looking at what's available when you do 'pure' SCF, I think it would be sufficient to allow (when doing dft) adjustment of the following SCF block parameters:
1. SCF convergence algorithm
2. max iterations
3. gradient
4. Computation direct/semi-direct

Everything else is already being set by dft. Also, to me it looks like the symmetry settings end up under geometry and not scf, so it's not relevant. Same goes for cosmo, which has it's own block.

As for xterm, I'm still working on my coffee so I'm not thinking clearly -- but my impression was that xterm was actually being found ok. It was the order of the switches that made things a bit screwy e.g.
xterm -e csh -i -l andy
works, but
xterm -l andy -e csh -i
does not. The shell must come before anything else.

/Andy


Forum >> ECCE: Extensible Computational Chemistry Environment >> General ECCE Topics