I installed NWChem 6.1 on our Infiniband cluster. Everything seems to work well, except for the TCE module when using an I/O scheme other than GA. For example, the following input (obtained from a test in the QA directory) works fine:
start tce_ccsd_t_h2o
echo
geometry units bohr
O 0.00000000 0.00000000 0.22138519
H 0.00000000 -1.43013023 -0.88554075
H 0.00000000 1.43013023 -0.88554075
end
basis spherical
H library cc-pVDZ
O library cc-pVDZ
end
scf
thresh 1.0e-10
tol2e 1.0e-10
singlet
rhf
end
tce
ccsd(t)
io ga
end
task tce energy
But if I use for instance "io sf" instead of "io ga", NWChem crashes:
Global files accessible by all nodes assumed
1:Floating Point Exception error, status=: 8
(rank:1 hostname:login01 pid:10626):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common /signaltrap.c:SigFpeHandler():249 cond:0
Last System Error Message from Task 1:: No such file or directory
The debug output shows that the crash occurs during/after "node 1 put_block request to file: -3000 size: 2 offset: 0".
I added some more debug statements in the code, which revealed that the actual crash occurs when ga_lock is being called within put_block.F.
Without MPI, this does not happen, since the lock is not needed in the case of serial execution. However, NWChem still crashes at a later point:
node 0 add_block request to file: -2993 size: 1 offset: 10
0:0:nga_get_common:nga_get_common: INVALID ARRAY HANDLE:: -2993
(rank:0 hostname:login01 pid:13361):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
Last System Error Message from Task 0:: No such file or directory
According to the debug output, several add_block requests to the same file handle were successful just before the crash...
Does anyone have an idea about what could be causing these errors? I already tried to recompile with GCC instead of the Intel compiler, with and without MPI support, with and without Infiniband support, etcetera, but it all did not make any difference, neither did the location of the scratch dir and permanent dir.
|