Error Running CPHF Module


Clicked A Few Times
Dear All,
I am running NWChem.6.5 on Mazama Tue Feb 21 03:35:44 CST 2012 on
hssbld4 by bwdev lsb-cray-mazama-6.0.0
SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11.
While running the CPHF module for analytic frequency calculation,
the run terminates without any error while calculation of hessian 2e 2nd derivative step.
However, I can successfully compute the frequencies with same input
in workstations of arch x86_64; Ubuntu 12.04.4 LTS ,
Precise Pangolin after allocating 5000 mb total memory. I would be grateful if someone can suggest where things are getting wrong.
Thanks in advance.

Forum Vet
Could you be more specific on the failures observed on the Cray system?
Are you sure that nothing was reported in the error log? Any error code reported by PBS?
What about the job being killed by the OOM (Out of memory) kernel component?

Clicked A Few Times
Thanks for the reply. When the run on Cray system was allocated a total memory of 2000mb ,
the last few lines of NWChem.6.5.e29089 error file are following:
...................................................................................................................................................................................
libhugetlbfs [nid00051:8126]: WARNING: New heap segment map at 0x10067800000 failed: Cannot allocate memory
[NID 00051] 2014-11-12 03:17:16 Apid 109986: initiated application termination
[NID 00051] 2014-11-12 14:39:25 Apid 109986: OOM killer terminated this process.


However, when the run on Cray system was allocated a total memory of 5000 mb,
the run terminates immediately with the nwc.out file with the message:
gethugepagesize()=8388608
hugetlb_default_page_size=8388608
_SC_PAGESIZE=4096
comex_page_size=8388608
comex_is_using_huge_pages=1
malloc_is_using_huge_pages=1
argument  1 = nwc.res.freq.nw
nwchem.F: ma_init failed (ga_uses_ma=F) 911
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------


and the last few lines of the error file are:

Rank 11 [Wed Nov 12 03:23:17 2014] [c0-0c1s6n1] application called MPI_Abort(comm=0x84000002, 911) - process 11
libhugetlbfs [nid00051:8249]: WARNING: New heap segment map at 0x10007000000 failed: Cannot allocate memory
...........................................................................................................................................................................................
libhugetlbfs [nid00051:8235]: WARNING: New heap segment map at 0x10007000000 failed: Cannot allocate memory
_pmiu_daemon(SIGCHLD): [NID 00051] [c0-0c1s6n1] [Wed Nov 12 03:23:17 2014] PE RANK 11 exit signal Aborted
[NID 00051] 2014-11-12 03:31:10 Apid 110084: initiated application termination

!!!!!!!!!!!!!C
Would you be so kind to help me out in where things are going wrong? Thanks.

Forum Vet
Since you got the 2GB input killed by the OOM ([NID 00051] 2014-11-12 14:39:25 Apid 109986: OOM killer terminated this process.),
increasing the memory to 5GB is going only to make things worse.
Could you please post your input file?
. Thanks, Edo

Clicked A Few Times
Thanks for the reply. Please find attached herewith the input files.
The first file is parent nw input followed by restart file to compute frequencies analytically.



title "Au20 SDD ECP Mult 1"
memory total 2000 mb
echo
start au20
charge 0
geometry units bohr
symmetry Td print
Au -2.11667583358231 -2.11667583358231 -2.11667583358231
Au 2.11667583358231 2.11667583358231 -2.11667583358231
Au -1.78967191122259 1.78967191122259 -5.76900626570761
Au -2.11667583358231 2.11667583358231 2.11667583358231
Au -1.78967191122259 5.76900626570761 -1.78967191122259
Au -5.76900626570761 1.78967191122259 -1.78967191122259
Au 1.78967191122259 -1.78967191122259 -5.76900626570761
Au -5.41833083216091 5.41833083216091 -5.41833083216091
Au -5.76900626570761 -1.78967191122259 1.78967191122259
Au 5.41833083216091 -5.41833083216091 -5.41833083216091
Au 1.78967191122259 1.78967191122259 5.76900626570761
Au 5.76900626570761 1.78967191122259 1.78967191122259
Au 5.41833083216091 5.41833083216091 5.41833083216091
Au -1.78967191122259 -1.78967191122259 5.76900626570761
Au 2.11667583358231 -2.11667583358231 2.11667583358231
Au -5.41833083216091 -5.41833083216091 5.41833083216091
Au -1.78967191122259 -5.76900626570761 1.78967191122259
Au 1.78967191122259 -5.76900626570761 -1.78967191122259
Au 5.76900626570761 -1.78967191122259 -1.78967191122259
Au 1.78967191122259 5.76900626570761 1.78967191122259
end
basis "ao basis" spherical PRINT REL
Au library stuttgart_rsc_1997_ecp
end
ECP
Au library stuttgart_rsc_1997_ecp
END
dft
convergence damp 80
direct
incore
xc becke88 lyp
grid xfine
tolerances tight
iterations 400
mulliken
mult 1
end
task dft
ecce_print ecce.out






title "Au20 Td Mult 1"
echo
memory total 2000 mb
restart au20
task dft hessian freq
ecce_print ecce.out

Forum Vet
number of processors
How many processors have you been using on both computers?

Clicked A Few Times
I am using 32 cores on cray system using aprun -n 32 -N 32 ~/Nwchem-6.5/bin/LINUX64/nwchem nwc.res.nw >nwc.freq.out,
whreas in the Ubuntu workstation, I am using 6 processors with
~/openmpi/bin/mpirun -np 6 ~/Nwchem-6.5/bin/LINUX64/nwchem nwc.res.nw >nwc.freq.out
script for running NWChem. Thanks.

Forum Vet
Hde
Could you please try to replace the memory line with the following one:

memory stack 500 mb heap 200 mb global 550 mb

Clicked A Few Times
Thanks so much for the reply. With the suggested memory distribution, the run fails soon after CPHF is initiated




 
iter nsub residual time
---- ------ -------- ---------
Application 110648 exit codes: 134
Application 110648 resources: utime ~148987s, stime ~22s

and the last few lines of the NWChem error file are:




PE 3 [Fri Nov 21 00:35:48 2014] [c0-0c0s6n0] [nid00012] LIBDMAPP ERROR: BAD CQE status 0x3a80000200000000, SOURCE_SSID_DREQ:MDD_INV
nwchem: ../../ga-5-3/comex/src-dmapp/comex.c:426: dmapp_network_lock: Assertion `dmapp_status == DMAPP_RC_SUCCESS' failed.
_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Fri Nov 21 00:35:48 2014] PE RANK 1 exit signal Aborted
[NID 00012] 2014-11-21 00:36:03 Apid 110648: initiated application termination




Clicked A Few Times
Any suggestion would be of great help. Thanks in advance.

Forum Vet
Hde
Cray will release a new version of the Global Arrays soon.
I have tested a pre-release and your input seems to run OK with it.
Please stay tuned

Forum Vet
Hde
Cray will release a new version of the Global Arrays soon.
I have tested a pre-release and your input seems to run OK with it.
Please stay tuned

Clicked A Few Times
Thanks so much for the reply. I appreciate your efforts regarding the same.

Clicked A Few Times
Would you be so kind to provide me with some link or information so that I can follow up the release
on the new version of the Global arrays? Thanks.

Forum Vet
Ryan Olson has brought back the Global Arrays repository that is optimized for Cray DMAPP
https://github.com/ryanolson/ga

Here are the instructions for compiling with ARMCI_NETWORK=DMAPP

cd $NWCHEM_TOP/src/tools
wget https://github.com/ryanolson/ga/archive/cray.zip -O cray.zip
unzip cray.zip
export GA_DIR=ga-cray (sh/bash)
setenv GA_DIR ga-cray (csh/tcsh)
make FC=ftn


Forum >> NWChem's corner >> Running NWChem