Segmentation fault in NWChem frequency calculation.


Clicked A Few Times
Dear All,

I am getting the following errors with a DFT frequency calculation:

 texas integral default override: limxmem =              25304282
texas integral default override: limxmem = 25304282


Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.




mpirun noticed that process rank 15 with PID 68544 on node sh-28-03 exited on signal 11 (Segmentation fault).


"nwchem.out.30231958" 34442L, 2260833C

Here is the input file:


start U-2a-2h2O-quintet

echo

  1. scratch_dir /scratch
ecce_print ecce.out
  1. permanent_dir /home/zzhang/nwchem-runs/actinides/uo2-ua1

  1. memory stack 1300 mb heap 100 mb global 600 mb noverify
  2. memory 2000 mb
charge 0
set int:txs:limxmem 25304282
geometry noautoz
U  -0.03322831    -0.00250651     0.02347074
O 0.36491105 0.46274428 2.46041270
O -1.75536397 -1.24367265 -0.76829477
N -1.83614918 -2.60114656 -1.07974077
O 1.85241176 -1.02412680 1.18663402
N 2.16069598 -2.36741277 1.30490953
N 0.19752645 -2.55115378 0.05165579
C -0.76182355 -3.26711518 -0.61411501
C -0.65292775 -4.75886930 -0.78681618
H -0.12186737 -4.98631746 -1.72929450
H -1.66021638 -5.18722545 -0.87893096
C 0.12547663 -5.36746346 0.40398724
H 0.29620510 -6.44040251 0.23596450
H -0.48116837 -5.27596649 1.32037541
C 1.47386517 -4.63909346 0.60332396
H 2.16041693 -4.88970332 -0.22602739
H 1.97718325 -4.95497061 1.52802260
C 1.26649155 -3.14677901 0.64084231
O 0.40539882 -0.46908562 -2.40657442
O 1.86045762 1.03243618 -1.11644779
N 2.16573408 2.37722177 -1.22321977
O -1.77280939 1.22798442 0.79390466
N -1.86663784 2.58504863 1.10367529
N 0.18409993 2.54758384 0.00281776
C 1.25921303 3.14990926 -0.56840317
C 1.45947137 4.64281936 -0.52186055
H 2.12754941 4.89399485 0.32235391
H 1.98026358 4.96351326 -1.43513744
C 0.10367696 5.36387381 -0.34774641
H 0.26499417 6.43770884 -0.17607430
H -0.48480168 5.26938731 -1.27561341
C -0.69369509 4.75027362 0.82775952
H -0.18342467 4.98207794 1.78058794
H -1.70561863 5.17157449 0.89925992
C -0.78942988 3.25758367 0.65417663
H 1.25074923 0.02080972 -2.58862922
H -0.00949654 -0.92939338 -3.16091198
H 1.21159388 -0.01836030 2.65742333
H -0.06834734 0.91799285 3.20754272
end


  1. set geometry:actlist 7:12 20:31


BASIS spherical
U library "Stuttgart_RLC_ECP"
N library "Stuttgart_RLC_ECP"
C library "Stuttgart_RLC_ECP"
O library "Stuttgart_RLC_ECP"
H library "DZVP_(DFT_Orbital)"
END

ECP
U library "Stuttgart_RLC_ECP"
N library "Stuttgart_RLC_ECP"
C library "Stuttgart_RLC_ECP"
O library "Stuttgart_RLC_ECP"
END

dft
xc xperdew91 perdew91
iterations 1600
CONVERGENCE ncydp 1600
CONVERGENCE ncyds 1600
CONVERGENCE ncysh 1600
CONVERGENCE damp 80
odft
mult 5
  1. vectors input uo2-2ah2-2h2-quintet.movecs
end

driver
MAXITER 600
end

task dft optimize

set fock:mirrmat f
task dft freq


Thank you!

Zhiyong

Forum Vet
Zhiyong
I have just run this job and it did not crash for me.
What version of NWChem have you been using? I strongly recommend you 6.8 or later
Cheers, Edo

Clicked A Few Times
Thank you Edo. Here is the version number:

NWChem v6.8-47-gdf6c956 and GA ga-5.6.3

Probably it is a out of memory issue? How much memory did you use?

Zhiyong




Quote:Edoapra Oct 29th 11:19 am
Zhiyong
I have just run this job and it did not crash for me.
What version of NWChem have you been using? I strongly recommend you 6.8 or later
Cheers, Edo

Clicked A Few Times
Here is the memory allocation I had

   heap     =  983039996 doubles =   7500.0 Mbytes
stack = 983040001 doubles = 7500.0 Mbytes
global = 1966080000 doubles = 15000.0 Mbytes (distinct from heap & stack)
total = 3932159997 doubles = 30000.0 Mbytes

I also set the following:

export MA_USE_ARMCI_MEM="T"
  1. export ARMCI_DEFAULT_SHMMAX=40960
export ARMCI_DEFAULT_SHMMAX=4096
unset MA_USE_ARMCI_MEM

Could these parameters make any differences?


Quote:Zyzhang Oct 30th 12:46 pm
Thank you Edo. Here is the version number:

NWChem v6.8-47-gdf6c956 and GA ga-5.6.3

Probably it is a out of memory issue? How much memory did you use?

Zhiyong




Quote:Edoapra Oct 29th 11:19 am
Zhiyong
I have just run this job and it did not crash for me.
What version of NWChem have you been using? I strongly recommend you 6.8 or later
Cheers, Edo

Forum Vet
MA_USE_ARMCI_MEM
Please never use MA_USE_ARMCI_MEM="T"

Your memory line is way too big. This one is enough

memory stack 1300 mb heap 100 mb global 2600 mb noverify

Clicked A Few Times
Do I still need "export ARMCI_DEFAULT_SHMMAX=4096" or should I get rid of it as well?

So that memory line indicates 4 gb of memory? I requested 60 GB of memory on a node and run 2 processes on each node. Each process will have a share of 30 GB of memory, thus the memory line I used. Is that I do not need that much memory for a calculation of this size?

Forum Vet
No need to set ARMCI_DEFAULT_SHMMAX

Clicked A Few Times
Edo,

I am getting the following errors. Is the warning of armci_set_mem_offset okay? Is the segmentation fault likely a bug or memory allocation issues? Perhaps I need to compile it differently?

Thanks!

Zhiyong

Parallel integral file used   23377 records with       0 large values

 texas integral default override: limxmem =              25304282
texas integral default override: limxmem = 25304282
9: WARNING:armci_set_mem_offset: offset changed -665873895424 to -665962270720
11: WARNING:armci_set_mem_offset: offset changed 81091211264 to 81002958848
12: WARNING:armci_set_mem_offset: offset changed -245720936448 to -245809270784
13: WARNING:armci_set_mem_offset: offset changed 110652801024 to 110564409344
15: WARNING:armci_set_mem_offset: offset changed -145046306816 to -145134944256
7: WARNING:armci_set_mem_offset: offset changed 523750916096 to 523662651392
1: WARNING:armci_set_mem_offset: offset changed 44323131392 to 44234817536
3: WARNING:armci_set_mem_offset: offset changed 331651493888 to 331563155456
5: WARNING:armci_set_mem_offset: offset changed -156872515584 to -156960903168


Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.




mpirun noticed that process rank 12 with PID 51046 on node sh-17-25 exited on signal 11 (Segmentation fault).

Forum Vet
need ARMCI_DEFAULT_SHMMAX
I am sorry about my last tip.
Your error log clearly shows that you need to set ARMCI_DEFAULT_SHMMAX
Please try ARMCI_DEFAULT_SHMMAX=8192

Clicked A Few Times
Thanks Edo. Just see your reply and I will try that when I have a chance.

Actually I got my calculations working by doing two things:

(1) ulimit -s unlimited (our default is not unlimited)
(2) using more processes and increase memory global allocation for each process.

I guess it is a combination of (1) stack limit of the compute node (2) size of global memory allocation and (3) ARMCI_DEFAULT_SHMMAX

I never had a clear understanding of how to allocate these optimally. Probably that matters more when the memory available is limited. When running on a system with adequate memory, probably that matters less.


Quote:Edoapra Oct 31st 3:55 pm
I am sorry about my last tip.
Your error log clearly shows that you need to set ARMCI_DEFAULT_SHMMAX
Please try ARMCI_DEFAULT_SHMMAX=8192


Forum >> NWChem's corner >> Running NWChem