TCE: SF READ WAIT ERROR CODE = -1990


Clicked A Few Times
Hi,

I have been running some large test TCE CCSD jobs with

2eorb
2emet 6
idiskx 1


set. Smaller jobs ran OK, but after increasing the number of electrons and basis functions in the job (by adding another molecule but keeping the number of non-frozen orbitals constant), I ran into the error

[code]
size_4a_m(i) 25 900429304
-----------------------
after sf_create
SF_READ_WAIT ERROR CODE = -1990
FU 42 264378 8654107050 3577815
zones put: sf problem2x1        1
------------------------------------------------------------------------
[\code]

I can track the source down to one of several TCE files. Taking tce_mo2e_disk_2eorb.F as an example, it looks like an sf_write fails. tools/pario/sf/shared.files.c shows this routine calls elio_awrite, defined in pario/elio/elio.c. THAT routine has the following:

[code]
...
  1. elif defined(CRAY)
      rc = WRITEA(fd->fd, (char*)buf, bytes, &cb_fout[aio_i].stat, DEFARG);
stat = (rc < 0)? -1 : 0;
...
  1. else
      stat = aio_write(cb_fout+aio_i);
[\code]

However, for the build, it seems like only CRAYXT is defined, not CRAY.

1. Does the error I'm seeing seem like it would be caused by calling aio_write rather than WRITEA? What would a return code of -1990 imply?
2. Should I modify source so that elio.c tests for "CRAYXT" rather than "CRAY", enabling the Cray WRITEA call rather than aio_write?

Sorry for the formatting, apparently the "code" tags aren't working.

Thanks,
Chris

Clicked A Few Times
Dear Chris
I would not recommend the use of shared file algorithms on the Cray system.

Instead pelase use the GA supported algorithms

2emet
2eorb 13

or

2emet
2eorb 14
split 2


they provide better parallel perfromance (and do not depend on the perfromance of the file system installed on your machine).

By the way, how big is the system (in terms of the basis set functions and number of correlated electrons)?

Best,
Karol

Clicked A Few Times
Hi Karol,

Right now, the test system has 1095 basis functions, 90 electrons, and 15 atoms ((H2S)5 with aug-cc-pVQZ). However, there are 35 linearly dependent vectors, so effectively 1060 MOs. I am keeping the 42 lowest and 1009 highest orbitals "frozen" (inactive would be the more precise term, I guess), so effectively a [6,9] active space.

The purpose is just to get memory requirements for a big calculation of my actual system, where I can systematically expand the active region up to the resource limits. I was trying calculations without the 2eorb or 2emet options, but was just segfaulting. Using the disk allowed a (H2S)4 trial to finish, and my assumption was that I could estimate global memory requirements for an in-core GA algorithm 13 or 14 job from the disk space used (1.2 TB, for that small (H2S)4 job), without having to set aside massive numbers of nodes only to seg fault once the job finally went through the queue.

So, I intend to switch to a better algorithm once I know the approximate memory requirement, but if I move to 13 or 14 now (in-core, according to the documentation) without knowing the memory requirement, I'll just end up segfaulting, won't I?


Forum >> NWChem's corner >> Running NWChem