Hi,
Thank you, Bert!. I believe that I have figured out the problem :
The main cause of this problem was that the original SHMMAX was set to 32GB, which is the half of the total memory of another linux cluster in our site. So someone has copied it over here when the vendor set up this cluster.
We simply set SHMMAX=12GB, which is half of the total memory of this 8-core node, but it resulted in out of memory error in ARMCI when more than 2 cores is used.
Setting SHMMAX value to the half of the physical memory PER CORE (i.e. 1.5GB) has fixed the memory problem for the stable release NWChem, and it runs well with 16 cores across 2 nodes
Now, I got another problem in the most recent developer release (Oct-25)
Regardless of whatever Math library, i.e. MKL ver. 10, 11 and 2011 (12), and GotoBLAS, I linked to this code, I got a running error:
** On entry to DGEMM parameter number 13 had an illegal value
e.g. h2o_opt.nw in QA
Job information
---------------
hostname = fuji373
program = ../nwchem-src-2011-Oct-25/bin/LINUX64/nwchem
date = Thu Nov 10 17:49:12 2011
compiled = Thu_Nov_10_17:30:05_2011
source = /home/GENERAL/chiensh/NWCHEM/nwchem-src-2011-Oct-25
nwchem branch = Development
input = h2o_opt.nw
prefix = h2o_opt_dat.
data base = ./h2o_opt_dat.db
status = startup
nproc = 16
time left = -1s
Memory information
------------------
heap = 13107201 doubles = 100.0 Mbytes
stack = 13107201 doubles = 100.0 Mbytes
global = 26214400 doubles = 200.0 Mbytes (distinct from heap & stack)
total = 52428802 doubles = 400.0 Mbytes
verify = yes
hardfail = no
Directory information
---------------------
0 permanent = .
0 scratch = .
NWChem Input Module
-------------------
ncenter= 3
Scaling coordinates for geometry "h2o_c1" by 1.889725989
(inverse scale = 0.529177249)
Turning off AUTOSYM since
SYMMETRY directive was detected!
Geometry "h2o_c1" -> ""
-----------------------
Output coordinates in angstroms (scale by 1.889725989 to convert to a.u.)
No. Tag Charge X Y Z
---- ---------------- ---------- -------------- -------------- --------------
1 O 8.0000 -0.11412678 0.00000000 -0.08291796
2 H 1.0000 -0.11412678 0.00000000 1.11708204
3 H 1.0000 1.02714104 0.00000000 -0.45373835
Atomic Mass
-----------
O 15.994910
H 1.007825
... ...
... ...
NWChem SCF Module
-----------------
ao basis = "ao basis"
functions = 19
atoms = 3
closed shells = 5
open shells = 0
charge = 0.00
wavefunction = RHF
input vectors = atomic
output vectors = ./h2o_opt_dat.movecs
use symmetry = F
symmetry adapt = F
Summary of "ao basis" -> "ao basis" (cartesian)
------------------------------------------------------------------------------
Tag Description Shells Functions and Types
---------------- ------------------------------ ------ ---------------------
O 6-31G* 6 15 3s2p1d
H 6-31G* 2 2 2s
Forming initial guess at 6.4s
Superposition of Atomic Density Guess
-------------------------------------
Sum of atomic energies: -75.75081731
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
Non-variational initial energy
------------------------------
Total energy = -75.502009
1-e energy = -118.566474
2-e energy = 35.736228
HOMO = 0.000000
LUMO = 0.000000
Starting SCF solution at 6.5s
----------------------------------------------
Quadratically convergent ROHF
Convergence threshold : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------
** On entry to DGEMM parameter number 13 had an illegal value
#quartets = 1.540D+03 #integrals = 7.659D+03 #direct = 0.0% #cached =100.0%
Integral file = ./h2o_opt_dat.aoints.00
Record size in doubles = 65536 No. of integs per rec = 43688
Max. records in memory = 2 Max. records in file = 144921
No. of bits per label = 8 No. of bits per value = 64
File balance: exchanges= 0 moved= 0 time= 0.0
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 7.3282379250 0.00D+00 0.00D+00 2.1
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
4:Segmentation Violation error, status=: 11
(rank:4 hostname:fuji373 pid:480):ARMCI DASSERT fail. ../../ga-5-0/armci/src/signaltrap.c:SigSegvHandler():312 cond:0
I am not sure if this is a known issue or not.
Regards
Dominic Chien
|