Floating Point Exception using shared file IO with TCE

Click here for full thread

Gets Around

4:08:13 AM PDT - Wed, Jul 23rd 2014

I have an Intel i7 system with 4 physical cores and 16 GB of memory. I am running under 64 bit Ubuntu 14.04 and I built NWChem against OpenBLAS 0.28 with the 64 bit integer interface.

I am using the June 2014 NWChem snapshot: http://nwchemgit.github.io/download.php?f=Nwchem-dev.revision25716-src.2014-06-09.tar.gz

Specifically, that corresponds to NWChem revision 25716 and ga revision 10496.

I am trying to run a calculation like this:

start dccsinglet

echo

print high

memory stack 3000 mb heap 200 mb global 3600 mb

charge 0

geometry units angstroms
   C       -0.13183        0.72345       -0.07866
  Cl       -1.15973       -0.55669       -0.69209
  Cl        1.24554        0.01838        0.74329
  symmetry c1
end

basis spherical
  * library aug-cc-pvtz
end

scf
  singlet
  uhf
end

tce
  io sf
  ccsd
end

task tce energy

I started out using "io ga" and slightly more generous memory settings, but even running with only 2 cores my machine was swapping once I got to the CCSD part. I interrupted the job, lowered the memory settings, and switched the I/O scheme to shared file as shown above. I understand that disk based schemes will be slow but they should still work if I am patient, and I can upgrade disk speed more easily than I can install more RAM. I am using a disk with about 2 TB free space for my calculations, and none of my attempts ever led to more than about 6 GB of files stored.

The problem is that the job crashes with a floating point exception as soon as it reaches the ccsd portion if I use shared file IO and two processors, like this:

mpirun -np 2 nwchem dcc-singlet.nw | tee dcc-singlet.nwo

Output from two processor attempt: http://pastebin.com/fSs4X0Et

If I run with only one processor, the job lives longer but ultimately crashes with INVALID ARRAY HANDLE:

mpirun -np 1 nwchem dcc-singlet.nw | tee dcc-singlet.nwo

Output from quasi-serial attempt: http://pastebin.com/xp3ZwVak

I also tried decreasing the tile size with no luck. If you want to see I can provide that output too. Additionally I tried the replicated, fortran, and eaf IO options, both with one and two processors, and they all failed in a similar manner, though I didn't save all of their outputs. Are the disk based IO schemes currently unsupported? I did a find/xargs/grep search through the QA directory and I didn't find a single TCE test using disk based IO.