error ival=5


Clicked A Few Times
Anyone know how to parse this information?

...
      Symmetry analysis of basis
      --------------------------

        a1         79
        a2         22
        b1         45
        b2         48

8: error ival=5
(rank:8 hostname:n1430 pid:5979):ARMCI DASSERT fail. ../../ga-5-3/armci/src/devices/openib/openib.c:armci_send_complete():459 cond:(pdscr->status==IBV_WC_SUCCESS)


Happened at start of a Hartree-Fock calculation,

  functions       =   194
  atoms           =    12
  alpha electrons =    46
  beta  electrons =    46
  charge          =   0.00
  wavefunction    = UHF
  input vectors   = atomic
  output vectors  = /scratch/cchang/WsZqXn/perm/uhf_singlet.movecs
  use symmetry    = T
  symmetry adapt  = T


16 ranks spread over 2 nodes. Curiously, this doesn't happen on every set of nodes, so I'm prepared to believe it's related to hardware, but I don't know what to tell our admins to look for.

Thanks

Forum Vet
Chris
Does this calculation use much memory?

Clicked A Few Times
Hi Edo,

Not at this stage. The failure is happening before the Hartree-Fock calculations even starts.


I hope 12 atoms and 194 basis functions doesn't test memory limits. I spec'd total 3600 mb per process, which should be much more than needed here (perhaps not for the follow-up CCSD(T), however).

Thanks; Chris

Clicked A Few Times
Update
- I am now seeing ival=12, rather than ival=5
- The failures are all occurring on a node other than the one hosting ranks 0 through n (for testing, I have n=7).
- stderr shows the last system-level error as "Bad address." strace shows some errors like this from sched_setaffinity

Clicked A Few Times
OK, I suspect the ival=12 error may have arisen from some problems with threaded MKL.
After re-linking with serial MKL, I am now back to the ival=5 error, with the last system error reported as No such file or directory.

When tracing the processes, the last such system errors I see look like a missing scratch file:
...
7441 stat("/scratch/cchang/WsZqXn/perm/Fe_CO4_H2O_CCSDpTp.dir_check_p.15", 0x7fffd49a1d10) = -1 ENOENT (No such file or directory)
7441 stat("/scratch/cchang/WsZqXn/perm/Fe_CO4_H2O_CCSDpTp.dir_check_p.15", 0x7fffd49a1dd0) = -1 ENOENT (No such file or directory)
7441 access("/scratch/cchang/WsZqXn/perm/Fe_CO4_H2O_CCSDpTp.dir_check_p.15", F_OK) = -1 ENOENT (No such file or directory)
7441 stat("/scratch/cchang/WsZqXn/scr/Fe_CO4_H2O_CCSDpTp.dir_check_s.15", 0x7fffd49a1d10) = -1 ENOENT (No such file or directory)
7441 stat("/scratch/cchang/WsZqXn/scr/Fe_CO4_H2O_CCSDpTp.dir_check_s.15", 0x7fffd49a1dd0) = -1 ENOENT (No such file or directory)
7441 access("/scratch/cchang/WsZqXn/scr/Fe_CO4_H2O_CCSDpTp.dir_check_s.15", F_OK) = -1 ENOENT (No such file or directory)

Is this expected behavior?

Clicked A Few Times
OK, seems to be resolved now. Forgot to apply the patches for the ival=5 error. The ival=12 error is still mysterious, but may be linked either to the math library handling, or something on our cluster (the failures were occurring only on Xeon Phi nodes, and only on some of those nodes).


Forum >> NWChem's corner >> Running NWChem