Gary,
Cheers for the links to the pdf links -- I've spent two hours today looking at the perl scripts, focussing on the gaussian-03.vib script. I have to admit defeat -- the programming looks clever, but obfuscated. I'm happy to provide g09 output files from ECCE runs if need be, but I am no friend of perl.
I've tried the new ECCE binaries and tail -f when doing node hopping works perfectly. It doesn't quite work with port-forwarding from what I can see though, but it's not critical.
I've got a new problem though:
With the new binaries I'm seeing an odd behaviour: if I submit a job via node hopping to a system with SGE everything goes fine in the sense that the job gets queued. However, ECCE doesn't show that the job is queued i.e. I still see a Blue Triangle rather than a Pale Green Dot. I've posted an example screenshot at http://verahill.blogspot.com.au/2012/06/troubleshooting-ecce.html
This seems like a new regression. Once the job starts the icon changes to a solid green dot as it should. Same if I log in remotely and qdel the job -- the job is recognised as being deleted by ecce and updated as such.
Case 1. If I submit to SGE on the same computer as ECCE, it works as it should.
Case 2. If I submit to SGE on a remote site via port forwarding, it works as it should.
Case 3. If I submit to SGE via node-hopping, it doesn't work.
Also, I see it echo a whole lot of things, including perl scripts -- I don't see this happen in Cases 1 and 2 above.
...
end
task dft optimize
2063+0 records in
4+1 records out
2063 bytes (2.1 kB) copied, 0.18434 seconds, 11.2 kB/s
CMDSTAT=0
+go+#!/bin/csh
# ECCE Submit Script
# Generated Mon Jun 11 13:23:02 EST 2012 with ECCE Version v6.3.
#
267 bytes (267 B) copied, 0.180516 seconds, 1.5 kB/s
CMDSTAT=0
+go+# parse Descriptor for NWCHEM output file
#
# Due to the way nwchem outputs U* theory mos, and the fact that we
# want to only parse the last one, the mo-related parsing is a little
# messy. A separate entry is required for alpha and beta properties.
# This applies to MO MOBETA ORBOCC ORBOCCBETA...
# Symmetry has been included.
#
[EGRADVEC]
Script=nwchem.egradvec
Begin=task_gradient%begin%total gradient
Frequency=all
End=task
[END]
..
8359+0 records in
16+1 records out
8359 bytes (8.4 kB) copied, 0.363564 seconds, 23.0 kB/s
CMDSTAT=0
+go+###############################################################################
#
# Filename:
#
# eccejobmonitor
#
# Abstract:
#
# This program implements a server that extracts data
Finally, I'm guessing that the "eccejobmonitor_went_bye_bye" below might be part of the puzzle. The output below is from submitting a job, which I then qdel on the remote server.
CMDSTAT=0
+go+exit; echo GOODBYE
date; echo CMDSTAT=$status
Sun Jun 10 20:00:36 PDT 2012
CMDSTAT=0
+go+uname -a; echo CMDSTAT=$status
Linux rupert.university.edu 2.6.18-238.19.1.el5xen #1 SMP Fri Jul 15 08:16:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
CMDSTAT=0
+go+if (-d /home/jdoe/.andy/jobs/testing/someone/performance-1) echo TRUE
TRUE
+go+cd /home/jdoe/.andy/jobs/testing/someone/performance-1
+go+perl eccejobmonitor -configFile eccejobmonitor.conf -jobId 382 -bookmark 0; echo eccejobmonitor_went_bye_bye
Creating remote shell:
machine (system)
remote shell ()
local shell (csh)
user name ()
password is 0 characters
Remote shell command:
arg 0: csh
arg 1: -fc
arg 2: echo +hi+ && csh -f
End remote shell command
+hi+
unalias precmd; set prompt=+go+; unset echo
% +go+unalias *
+go+date; echo CMDSTAT=$status
Mon Jun 11 13:00:37 EST 2012
CMDSTAT=0
+go+if (-d /tmp/ecce_andy/jobs/performance-1__aBh6Wr) echo TRUE
TRUE
+go+cd /tmp/ecce_andy/jobs/performance-1__aBh6Wr
+go+eccejobmonitor_went_bye_bye
+go+date; echo CMDSTAT=$status
Sun Jun 10 20:01:18 PDT 2012
CMDSTAT=0
+go+date; echo CMDSTAT=$status
Sun Jun 10 20:01:18 PDT 2012
CMDSTAT=0
+go+/bin/rm -f eccejobmonitor eccejobmonitor.conf eccejobmonitor.propbuf *.desc; echo CMDSTAT=$status
CMDSTAT=0
+go+exit; echo GOODBYE
GOODBYE
exit
[jdoe@rupert ~]$ exit
logout
Connection to rupert.local closed.
+go+exit; echo GOODBYE
GOODBYE
exit
[jdoe@rupert ~]$ exit
logout
Connection to rupert.university.edu closed.
Transferred: sent 164576, received 197496 bytes, in 66.5 seconds
Bytes per second: sent 2474.2, received 2969.1
exit; echo GOODBYE
[when I posted what follows below originally I hadn't yet solved it -- someone reading this later might find this helpful]
In addition, I suddenly had problems importing ecce output files -- it said "ERROR: Setup parse script NWChem.expt does not exist or is not executable.", even though everything that should be, is in PATH and the file is executable.
Part of the problem was that the ecce_env script (which is unchanged from the previous version) wasn't evaluated correctly (debian/csh):
if ( `echo $PATH | grep -c "${ECCE_HOME}/scripts/parsers"` == 0 ) then
Word too long.
[..]
if ( `echo $PATH | grep -c "/usr/sbin"` == 0 ) then
Word too long.
[..]
if ( `echo $PATH | grep -c ":.:"` == 0 ) then
Word too long.
[..]
if ( -x /home/andy/.ecce/ecce-6.3e/apps/rhel5-gcc4.1.2-m64/3rdparty/system/bin/python && `echo $PATH | grep -c "${ECCE_HOME}/${ECCE_SYSDIR}3rdparty/system/bin"` == 0 ) then
Word too long.
The problem was bsd-csh which can only handle 1024 chars per line -- the word too long was referring to the length of $PATH. tcsh ins't supposed to have these limitations
The fix was simple (on debian):
sudo apt-get install tcsh
sudo update-alternatives --config csh
select tcsh
|