Error Running CPHF Module

Clicked A Few Times

12:33:47 AM PST - Mon, Nov 10th 2014
Dear All, I am running NWChem.6.5 on Mazama Tue Feb 21 03:35:44 CST 2012 on hssbld4 by bwdev lsb-cray-mazama-6.0.0 SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11. While running the CPHF module for analytic frequency calculation, the run terminates without any error while calculation of hessian 2e 2nd derivative step. However, I can successfully compute the frequencies with same input in workstations of arch x86_64; Ubuntu 12.04.4 LTS , Precise Pangolin after allocating 5000 mb total memory. I would be grateful if someone can suggest where things are getting wrong. Thanks in advance.

Forum Vet

12:05:21 PM PST - Mon, Nov 10th 2014
Could you be more specific on the failures observed on the Cray system? Are you sure that nothing was reported in the error log? Any error code reported by PBS? What about the job being killed by the OOM (Out of memory) kernel component?

Clicked A Few Times

2:35:14 AM PST - Wed, Nov 12th 2014
Thanks for the reply. When the run on Cray system was allocated a total memory of 2000mb , the last few lines of NWChem.6.5.e29089 error file are following: ................................................................................................................................................................................... libhugetlbfs [nid00051:8126]: WARNING: New heap segment map at 0x10067800000 failed: Cannot allocate memory [NID 00051] 2014-11-12 03:17:16 Apid 109986: initiated application termination [NID 00051] 2014-11-12 14:39:25 Apid 109986: OOM killer terminated this process. However, when the run on Cray system was allocated a total memory of 5000 mb, the run terminates immediately with the nwc.out file with the message: gethugepagesize()=8388608 hugetlb_default_page_size=8388608 _SC_PAGESIZE=4096 comex_page_size=8388608 comex_is_using_huge_pages=1 malloc_is_using_huge_pages=1 argument 1 = nwc.res.freq.nw nwchem.F: ma_init failed (ga_uses_ma=F) 911 ------------------------------------------------------------------------ ------------------------------------------------------------------------ current input line : 0: ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ and the last few lines of the error file are: Rank 11 [Wed Nov 12 03:23:17 2014] [c0-0c1s6n1] application called MPI_Abort(comm=0x84000002, 911) - process 11 libhugetlbfs [nid00051:8249]: WARNING: New heap segment map at 0x10007000000 failed: Cannot allocate memory ........................................................................................................................................................................................... libhugetlbfs [nid00051:8235]: WARNING: New heap segment map at 0x10007000000 failed: Cannot allocate memory _pmiu_daemon(SIGCHLD): [NID 00051] [c0-0c1s6n1] [Wed Nov 12 03:23:17 2014] PE RANK 11 exit signal Aborted [NID 00051] 2014-11-12 03:31:10 Apid 110084: initiated application termination !!!!!!!!!!!!!C Would you be so kind to help me out in where things are going wrong? Thanks.

Forum Vet

10:45:40 AM PST - Wed, Nov 12th 2014
Since you got the 2GB input killed by the OOM ([NID 00051] 2014-11-12 14:39:25 Apid 109986: OOM killer terminated this process.), increasing the memory to 5GB is going only to make things worse. Could you please post your input file? . Thanks, Edo

Clicked A Few Times

11:10:14 PM PST - Wed, Nov 12th 2014
Thanks for the reply. Please find attached herewith the input files. The first file is parent nw input followed by restart file to compute frequencies analytically. title "Au20 SDD ECP Mult 1" memory total 2000 mb echo start au20 charge 0 geometry units bohr symmetry Td print Au -2.11667583358231 -2.11667583358231 -2.11667583358231 Au 2.11667583358231 2.11667583358231 -2.11667583358231 Au -1.78967191122259 1.78967191122259 -5.76900626570761 Au -2.11667583358231 2.11667583358231 2.11667583358231 Au -1.78967191122259 5.76900626570761 -1.78967191122259 Au -5.76900626570761 1.78967191122259 -1.78967191122259 Au 1.78967191122259 -1.78967191122259 -5.76900626570761 Au -5.41833083216091 5.41833083216091 -5.41833083216091 Au -5.76900626570761 -1.78967191122259 1.78967191122259 Au 5.41833083216091 -5.41833083216091 -5.41833083216091 Au 1.78967191122259 1.78967191122259 5.76900626570761 Au 5.76900626570761 1.78967191122259 1.78967191122259 Au 5.41833083216091 5.41833083216091 5.41833083216091 Au -1.78967191122259 -1.78967191122259 5.76900626570761 Au 2.11667583358231 -2.11667583358231 2.11667583358231 Au -5.41833083216091 -5.41833083216091 5.41833083216091 Au -1.78967191122259 -5.76900626570761 1.78967191122259 Au 1.78967191122259 -5.76900626570761 -1.78967191122259 Au 5.76900626570761 -1.78967191122259 -1.78967191122259 Au 1.78967191122259 5.76900626570761 1.78967191122259 end basis "ao basis" spherical PRINT REL Au library stuttgart_rsc_1997_ecp end ECP Au library stuttgart_rsc_1997_ecp END dft convergence damp 80 direct incore xc becke88 lyp grid xfine tolerances tight iterations 400 mulliken mult 1 end task dft ecce_print ecce.out title "Au20 Td Mult 1" echo memory total 2000 mb restart au20 task dft hessian freq ecce_print ecce.out

Forum Vet

11:23:09 AM PST - Thu, Nov 13th 2014
number of processors
How many processors have you been using on both computers?

Clicked A Few Times

10:44:55 PM PST - Thu, Nov 13th 2014
I am using 32 cores on cray system using aprun -n 32 -N 32 ~/Nwchem-6.5/bin/LINUX64/nwchem nwc.res.nw >nwc.freq.out, whreas in the Ubuntu workstation, I am using 6 processors with ~/openmpi/bin/mpirun -np 6 ~/Nwchem-6.5/bin/LINUX64/nwchem nwc.res.nw >nwc.freq.out script for running NWChem. Thanks.

Forum Vet

4:15:30 PM PST - Fri, Nov 14th 2014
Hde Could you please try to replace the memory line with the following one: `memory stack 500 mb heap 200 mb global 550 mb`

Clicked A Few Times

11:42:42 PM PST - Thu, Nov 20th 2014

Thanks so much for the reply. With the suggested memory distribution, the run fails soon after CPHF is initiated

 

  iter   nsub   residual    time

  ----  ------  --------  ---------

Application 110648 exit codes: 134
Application 110648 resources: utime ~148987s, stime ~22s

and the last few lines of the NWChem error file are:

PE 3 [Fri Nov 21 00:35:48 2014] [c0-0c0s6n0] [nid00012] LIBDMAPP ERROR: BAD CQE status 0x3a80000200000000, SOURCE_SSID_DREQ:MDD_INV

nwchem: ../../ga-5-3/comex/src-dmapp/comex.c:426: dmapp_network_lock: Assertion `dmapp_status == DMAPP_RC_SUCCESS' failed.
_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Fri Nov 21 00:35:48 2014] PE RANK 1 exit signal Aborted
[NID 00012] 2014-11-21 00:36:03 Apid 110648: initiated application termination

Clicked A Few Times

12:44:50 AM PST - Tue, Nov 25th 2014
Any suggestion would be of great help. Thanks in advance.

Forum Vet

10:02:30 AM PST - Tue, Nov 25th 2014
Hde Cray will release a new version of the Global Arrays soon. I have tested a pre-release and your input seems to run OK with it. Please stay tuned

Forum Vet

11:34:33 AM PST - Tue, Nov 25th 2014
Hde Cray will release a new version of the Global Arrays soon. I have tested a pre-release and your input seems to run OK with it. Please stay tuned

Clicked A Few Times

11:11:55 PM PST - Tue, Nov 25th 2014
Thanks so much for the reply. I appreciate your efforts regarding the same.

Clicked A Few Times

12:40:49 AM PST - Wed, Dec 3rd 2014
Would you be so kind to provide me with some link or information so that I can follow up the release on the new version of the Global arrays? Thanks.

Forum Vet

4:36:25 PM PST - Wed, Dec 3rd 2014
Ryan Olson has brought back the Global Arrays repository that is optimized for Cray DMAPP https://github.com/ryanolson/ga Here are the instructions for compiling with ARMCI_NETWORK=DMAPP cd $NWCHEM_TOP/src/tools wget https://github.com/ryanolson/ga/archive/cray.zip -O cray.zip unzip cray.zip export GA_DIR=ga-cray (sh/bash) setenv GA_DIR ga-cray (csh/tcsh) make FC=ftn

Forum >> NWChem's corner >> Running NWChem