Hi all,
I'm trying to run a CCSD(T) simulation but I keep running in the following problem:
(rank:0 hostname:an-24 pid:71069):ARMCI DASSERT fail. ../../ga-5.6.3/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
I'm using NWChem-6.8 (https://github.com/nwchemgit/nwchem/archive/v6.8-release.tar.gz).
Here is my job script:
#!/bin/bash
#SBATCH -N 8
#SBATCH --mem 400000
#SBATCH --ntasks-per-node=40
#SBATCH -t 0-8:00 # time (D-HH:MM)
module load OpenMPI
export NWCHEM_TOP=${HOME}/nwchem-6.8-release
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX_UBOUND=368640
export ARMCI_DEFAULT_SHMMAX=368640
export USE_MPI=y
export NWCHEM_MODULES=all
export USE_MPIF=y
export USE_MPIF4=y
export USE_INTERNALBLAS=y
export SCALAPACK_SIZE=8
export BLAS_SIZE=8
export LAPACK_SIZE=8
export PYTHONHOME=/usr
export PYTHONVERSION=2.7
export USE_PYTHONCONFIG=Y
export USE_CPPRESERVE=y
export USE_NOFSCHECK=y
export TCE_CUDA=y
export CUDA_INCLUDE="-I${CUDA_ROOT}/include"
export CUDA_LIBS="${CUDA_ROOT}/lib64/libcublas.so ${CUDA_ROOT}/lib64/libcudart.so"
export GPU_ARCH="sm_70"
export IPCCSD=y
export EACCSD=y
export MRCC_METHODS=y
export USE_OPENMP=y
unset MPI_INCLUDE
unset MPI_LIB
unset LIBMPI
unset GA_DEV
export OMP_NUM_THREADS=1
export KMP_AFFINITY=scatter
mpirun -n 320 -npernode 40 ${NWCHEM_TOP}/bin/LINUX64/nwchem ${NWCHEM_TOP}/run/input.nw
As well as the (partial) output:
argument 1 = /nwchem-6.8-release/run/input.nw
NWChem w/ OpenMP: maximum threads = 1
============================== echo of input deck ==============================
title "uracil-6-31-Gs"
echo
start uracil-6-31-Gs
memory stack 2500 mb heap 300 mb global 5000 mb noverify
basis cartesian
* library 6-31G*
end
scf
thresh 1.0e-10
tol2e 1.0e-10
singlet
rhf
end
tce
freeze atomic
ccsd(t)
tilesize 24
2eorb
2emet 13
attilesize 40
thresh 1.0d-1
cuda 6
end
task tce energy
================================================================================
Job information
---------------
hostname = an-24
program = /nwchem-6.8-release/bin/LINUX64/nwchem
date = Tue Jun 26 10:48:33 2018
compiled = Mon_Jun_25_21:44:59_2018
source = /nwchem-6.8-release
nwchem branch = 6.8
nwchem revision = N/A
ga revision = ga-5.6.3
use scalapack = F
input = /nwchem-6.8-release/run/input.nw
prefix = uracil-6-31-Gs.
data base = ./uracil-6-31-Gs.db
status = startup
nproc = 320
time left = -1s
Memory information
------------------
heap = 39321596 doubles = 300.0 Mbytes
stack = 327680001 doubles = 2500.0 Mbytes
global = 655360000 doubles = 5000.0 Mbytes (distinct from heap & stack)
total = 1022361597 doubles = 7800.0 Mbytes
verify = no
hardfail = no
Directory information
---------------------
0 permanent = .
0 scratch = .
NWChem Input Module
-------------------
uracil-6-31-Gs
--------------
Scaling coordinates for geometry "geometry" by 1.889725989
(inverse scale = 0.529177249)
Turning off AUTOSYM since
SYMMETRY directive was detected!
------
auto-z
------
autoz: The atoms group into disjoint clusters
cluster 1: 1 2 3 4 5 6 7 8 9 10 11
12
cluster 2: 13 14 15 16 17 18 19 20 21 22 23
24
cluster 3: 25 26 27 28 29 30 31 32 33 34 35
36
1 autoz failed with cvr_scaling = 1.2 changing to 1.3
autoz: The atoms group into disjoint clusters
cluster 1: 1 2 3 4 5 6 7 8 9 10 11
12
cluster 2: 13 14 15 16 17 18 19 20 21 22 23
24
cluster 3: 25 26 27 28 29 30 31 32 33 34 35
36
2 autoz failed with cvr_scaling = 1.3 changing to 1.4
autoz: The atoms group into disjoint clusters
cluster 1: 1 2 3 4 5 6 7 8 9 10 11
12
cluster 2: 13 14 15 16 17 18 19 20 21 22 23
24
cluster 3: 25 26 27 28 29 30 31 32 33 34 35
36
3 autoz failed with cvr_scaling = 1.4 changing to 1.5
autoz: The atoms group into disjoint clusters
cluster 1: 1 2 3 4 5 6 7 8 9 10 11
12
cluster 2: 13 14 15 16 17 18 19 20 21 22 23
24
cluster 3: 25 26 27 28 29 30 31 32 33 34 35
36
4 autoz failed with cvr_scaling = 1.5 changing to 1.6
autoz: The atoms group into disjoint clusters
cluster 1: 1 2 3 4 5 6 7 8 9 10 11
12
cluster 2: 13 14 15 16 17 18 19 20 21 22 23
24
cluster 3: 25 26 27 28 29 30 31 32 33 34 35
36
5 autoz failed with cvr_scaling = 1.6 changing to 1.7
warning. autoz generated 7 bonds for atom 1
warning. autoz generated 7 bonds for atom 13
warning. autoz generated 7 bonds for atom 25
autoz: The atoms group into disjoint clusters
cluster 1: 1 2 3 4 5 6 7 8 9 10 11
12
cluster 2: 13 14 15 16 17 18 19 20 21 22 23
24
cluster 3: 25 26 27 28 29 30 31 32 33 34 35
36
AUTOZ failed to generate good internal coordinates.
Cartesian coordinates will be used in optimizations.
....
General Information
-------------------
Number of processors : 320
Wavefunction type : Restricted Hartree-Fock
No. of electrons : 174
Alpha electrons : 87
Beta electrons : 87
No. of orbitals : 768
Alpha orbitals : 384
Beta orbitals : 384
Alpha frozen cores : 24
Beta frozen cores : 24
Alpha frozen virtuals : 0
Beta frozen virtuals : 0
Spin multiplicity : singlet
Number of AO functions : 384
Number of AO shells : 168
Use of symmetry is : off
Symmetry adaption is : off
Schwarz screening : 0.10D-09
Correlation Information
-----------------------
Calculation type : Coupled-cluster singles & doubles w/ perturbation
Perturbative correction : (T)
Max iterations : 100
Residual threshold : 0.10D+00
T(0) DIIS level shift : 0.00D+00
L(0) DIIS level shift : 0.00D+00
T(1) DIIS level shift : 0.00D+00
L(1) DIIS level shift : 0.00D+00
T(R) DIIS level shift : 0.00D+00
T(I) DIIS level shift : 0.00D+00
CC-T/L Amplitude update : 5-th order DIIS
I/O scheme : Global Array Library
L-threshold : 0.10D+00
EOM-threshold : 0.10D+00
no EOMCCSD initial starts read in
TCE RESTART OPTIONS
READ_INT: F
WRITE_INT: F
READ_TA: F
WRITE_TA: F
READ_XA: F
WRITE_XA: F
READ_IN3: F
WRITE_IN3: F
SLICE: F
D4D5: F
Memory Information
------------------
Available GA space size is ********** doubles
Available MA space size is 366945284 doubles
Maximum block size supplied by input
Maximum block size 24 doubles
tile_dim = 23
Block Spin Irrep Size Offset Alpha
-------------------------------------------------
1 alpha a 21 doubles 0 1
2 alpha a 21 doubles 21 2
3 alpha a 21 doubles 42 3
4 beta a 21 doubles 63 1
5 beta a 21 doubles 84 2
6 beta a 21 doubles 105 3
7 alpha a 22 doubles 126 7
8 alpha a 23 doubles 148 8
9 alpha a 23 doubles 171 9
10 alpha a 23 doubles 194 10
11 alpha a 23 doubles 217 11
12 alpha a 23 doubles 240 12
13 alpha a 22 doubles 263 13
14 alpha a 23 doubles 285 14
15 alpha a 23 doubles 308 15
16 alpha a 23 doubles 331 16
17 alpha a 23 doubles 354 17
18 alpha a 23 doubles 377 18
19 alpha a 23 doubles 400 19
20 beta a 22 doubles 423 7
21 beta a 23 doubles 445 8
22 beta a 23 doubles 468 9
23 beta a 23 doubles 491 10
24 beta a 23 doubles 514 11
25 beta a 23 doubles 537 12
26 beta a 22 doubles 560 13
27 beta a 23 doubles 582 14
28 beta a 23 doubles 605 15
29 beta a 23 doubles 628 16
30 beta a 23 doubles 651 17
31 beta a 23 doubles 674 18
32 beta a 23 doubles 697 19
Global array virtual files algorithm will be used
Parallel file system coherency ......... OK
Integral file = ./uracil-6-31-Gs.aoints.000
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 263 Max. records in file = 7474008
No. of bits per label = 16 No. of bits per value = 64
#quartets = 9.792D+06 #integrals = 1.193D+08 #direct = 0.0% #cached =100.0%
File balance: exchanges= 443 moved= 598 time= 0.1
Fock matrix recomputed
1-e file size = 129600
1-e file name = ./uracil-6-31-Gs.f1
Cpu & wall time / sec 0.8 1.0
4-electron integrals stored in orbital form
v2 file size = 2387981089
4-index algorithm nr. 13 is used
imaxsize = 40
imaxsize ichop = 0
0: error ival=4
(rank:0 hostname:an-24 pid:71069):ARMCI DASSERT fail. ../../ga-5.6.3/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
80: error ival=10
(rank:80 hostname:an-26 pid:11116):ARMCI DASSERT fail. ../../ga-5.6.3/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
40: error ival=4
(rank:40 hostname:an-25 pid:67763):ARMCI DASSERT fail. ../../ga-5.6.3/armci/src/devices/openib/openib.c:armci_call_data_server():2209 cond:(pdscr->status==IBV_WC_SUCCESS)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 80 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Does someone encounter something similar or knows how to fix this?
Any help is greatly appreciated.
Pav
|