Floating Point Exception using shared file IO with TCE


Gets Around
I have an Intel i7 system with 4 physical cores and 16 GB of memory. I am running under 64 bit Ubuntu 14.04 and I built NWChem against OpenBLAS 0.28 with the 64 bit integer interface.

I am using the June 2014 NWChem snapshot: http://nwchemgit.github.io/download.php?f=Nwchem-dev.revision25716-src.2014-06-09.tar.gz

Specifically, that corresponds to NWChem revision 25716 and ga revision 10496.

I am trying to run a calculation like this:

start dccsinglet

echo

print high

memory stack 3000 mb heap 200 mb global 3600 mb

charge 0

geometry units angstroms
   C       -0.13183        0.72345       -0.07866
  Cl       -1.15973       -0.55669       -0.69209
  Cl        1.24554        0.01838        0.74329
  symmetry c1
end

basis spherical
  * library aug-cc-pvtz
end

scf
  singlet
  uhf
end

tce
  io sf
  ccsd
end

task tce energy


I started out using "io ga" and slightly more generous memory settings, but even running with only 2 cores my machine was swapping once I got to the CCSD part. I interrupted the job, lowered the memory settings, and switched the I/O scheme to shared file as shown above. I understand that disk based schemes will be slow but they should still work if I am patient, and I can upgrade disk speed more easily than I can install more RAM. I am using a disk with about 2 TB free space for my calculations, and none of my attempts ever led to more than about 6 GB of files stored.

The problem is that the job crashes with a floating point exception as soon as it reaches the ccsd portion if I use shared file IO and two processors, like this:

mpirun -np 2 nwchem dcc-singlet.nw | tee dcc-singlet.nwo


Output from two processor attempt: http://pastebin.com/fSs4X0Et

If I run with only one processor, the job lives longer but ultimately crashes with INVALID ARRAY HANDLE:

mpirun -np 1 nwchem dcc-singlet.nw | tee dcc-singlet.nwo


Output from quasi-serial attempt: http://pastebin.com/xp3ZwVak

I also tried decreasing the tile size with no luck. If you want to see I can provide that output too. Additionally I tried the replicated, fortran, and eaf IO options, both with one and two processors, and they all failed in a similar manner, though I didn't save all of their outputs. Are the disk based IO schemes currently unsupported? I did a find/xargs/grep search through the QA directory and I didn't find a single TCE test using disk based IO.

Forum Vet
Mernst
Please apply the following patch to address the floating-point exception observed when running with processors
http://nwchemgit.github.io/images/Tcemutexes.patch.gz

Forum Vet
Quote:Mernst Jul 23rd 3:08 am
Are the disk based IO schemes currently unsupported?


Yes, it's fair to say that disk based IO scheme are not really maintained.
Please use the default GA based algorithm.
If you download the latest dev tarball, use three processes and the following input file,
the calculation should be affordable.
Cheers, Edo

start dccsinglet

echo

memory stack 1000 mb heap 200 mb global 1600 mb  

charge 0

geometry units angstroms
   C       -0.13183        0.72345       -0.07866
  Cl       -1.15973       -0.55669       -0.69209
  Cl        1.24554        0.01838        0.74329
  symmetry c1
end

basis spherical
  * library aug-cc-pvtz
end

scf
direct
  singlet
#  uhf
vectors input dccsinglet.movecs
end

tce
#  io sf
2eorb
2emet 15
attilesize 16
tilesize 16

  ccsd
end

task tce energy

Gets Around
I can economize memory on this calculation using RHF instead of UHF. But I can't use RHF with the companion calculation for the triplet state. ROHF has more options for economizing memory with the TCE, but ideally I'd like to try both ROHF and UHF. Of course if that's not possible I will just have to adjust my approach/expectations.

I grabbed the latest July tarball, patched and rebuilt, and tried my original input again. The patch fixed the floating point exception but led to the same invalid array handle crash that I experienced with my previous quasi-serial attempt: http://pastebin.com/gqtLnXt7

It's understandable that HPC facility users overwhelmingly choose the in-core GA approach and the others have fallen into disuse. Updating the docs to warn that non-GA IO doesn't work would be helpful: http://nwchemgit.github.io/index.php/TCE

Gets Around
bug in mutex patch?
I was running the QA tests in parallel mode after building the patched July snapshot and noticed that the li2h2_tce_ccsd.nw test was stuck. It had been running for 70 minutes with CPUs pinned at 100% but making no progress. The unpatched version completes the job in a few minutes with either serial or parallel execution. The patched version completes the job only in serial execution. Having written some deadlocking mutex code myself in the past, that's what this feels like to me. Here's the output from a 2 processor attempt up to the point it stalls, not that I spotted any great clues: http://www.sciencemadness.org/cc/li2h2_tce_ccsd.nwo.gz

Gets Around
Bump - li2h2_tce_ccsd hangs
I noticed in commit 27051 that the li2h2_tce_ccsd.nw test had been commented out because it was hanging on some machines. It has been hanging for me since the TCE mutexes patch of July 23 2014 (old thread above). It hangs when run with more than 1 processor under Linux with OpenMPI or under OS X with mpich2.

Forum Vet
I can confirm that under MAC OS X 10.10.3 and mpich2-3.1.3_1 without any patches installed, test li2h2_tce_ccsd, where the projected frequencies 1,2,3,4,5 and 6 are different and constants A, B and C are slightly different from those in the official output, respectively, can pass within shorter time parallelly, and the test for which Dr. Mernst has prepared the input is stuck and obtains the following

...

====
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 68269 RUNNING AT
= EXIT CODE: 79
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
================================================================


Forum >> NWChem's corner >> Running NWChem