MA error: MA init: could not allocate xxxxxxxxxx bytes


Clicked A Few Times
Hi,
I successfully compiled NWChem-6.0 (stable release) on a Fuji cluster, with 8 Intel core and 24G memory for each node and , however it got problem when I run the QA test for more than 1 cpu

For example, when I try to run autosym in QA/tests with 2 cpu within the same node, I got this memory error message

id=-1 size=33554432
******************* ARMCI INFO ************************
The application attempted to allocate a shared memory segment of 33554432 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.
*******************************************************
0:allocate: failed to create shared region : -1
(rank:0 hostname:fuji373 pid:27903):ARMCI DASSERT fail. shmem.c:armci_allocate():1082 cond:0
Last System Error Message from Task 0:: Cannot allocate memory
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

This system has SHMMAX= 68719476736 with NO SWAP SPACE!

On the other hand, if the job is run across the node, I got another set of error message:

ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
argument 1 = autosym.nw
MA error: MA_init: could not allocate 1555256528 bytes
______________________________________________
nwchem.F: ma_init failed (ga_uses_ma=F) 911
______________________________________________
______________________________________________
current input line :
0: end
______________________________________________
______________________________________________
______________________________________________
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation


For further details see manual section:
...

Please let me know what I can do to run the QA test. Thanks

Forum Vet
On problem 1: Please check with your administrator what the SHMMAX settings are for the system. Looks like they are set very small.

On problem 2: What did you set as default memory sizes when you build NWChem (-DDFLT_TOT_MEM)? Looks like you set it huge as it tires to allocate 1.5GByte of local memory.

Hence, please provide some details on the build process, environment variables, and settings.

Bert


Quote:Chiensh Nov 3rd 3:08 am
Hi,
I successfully compiled NWChem-6.0 (stable release) on a Fuji cluster, with 8 Intel core and 24G memory for each node and , however it got problem when I run the QA test for more than 1 cpu

For example, when I try to run autosym in QA/tests with 2 cpu within the same node, I got this memory error message

id=-1 size=33554432
******************* ARMCI INFO ************************
The application attempted to allocate a shared memory segment of 33554432 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.
This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.
*******************************************************
0:allocate: failed to create shared region : -1
(rank:0 hostname:fuji373 pid:27903):ARMCI DASSERT fail. shmem.c:armci_allocate():1082 cond:0
Last System Error Message from Task 0:: Cannot allocate memory
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

This system has SHMMAX= 68719476736 with NO SWAP SPACE!

On the other hand, if the job is run across the node, I got another set of error message:

ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB Verbs API'.
argument 1 = autosym.nw
MA error: MA_init: could not allocate 1555256528 bytes
______________________________________________
nwchem.F: ma_init failed (ga_uses_ma=F) 911
______________________________________________
______________________________________________
current input line :
0: end
______________________________________________
______________________________________________
______________________________________________
For more information see the NWChem manual at
http://nwchemgit.github.io/index.php/NWChem_Documentation


For further details see manual section:
...

Please let me know what I can do to run the QA test. Thanks

Clicked A Few Times
Hi,

Thank you, Bert!. I believe that I have figured out the problem :

The main cause of this problem was that the original SHMMAX was set to 32GB, which is the half of the total memory of another linux cluster in our site. So someone has copied it over here when the vendor set up this cluster.

We simply set SHMMAX=12GB, which is half of the total memory of this 8-core node, but it resulted in out of memory error in ARMCI when more than 2 cores is used.

Setting SHMMAX value to the half of the physical memory PER CORE (i.e. 1.5GB) has fixed the memory problem for the stable release NWChem, and it runs well with 16 cores across 2 nodes

Now, I got another problem in the most recent developer release (Oct-25)
Regardless of whatever Math library, i.e. MKL ver. 10, 11 and 2011 (12), and GotoBLAS, I linked to this code, I got a running error:

** On entry to DGEMM  parameter number 13 had an illegal value

e.g. h2o_opt.nw in QA
        Job information
---------------
hostname = fuji373
program = ../nwchem-src-2011-Oct-25/bin/LINUX64/nwchem
date = Thu Nov 10 17:49:12 2011
compiled = Thu_Nov_10_17:30:05_2011
source = /home/GENERAL/chiensh/NWCHEM/nwchem-src-2011-Oct-25
nwchem branch = Development
input = h2o_opt.nw
prefix = h2o_opt_dat.
data base = ./h2o_opt_dat.db
status = startup
nproc = 16
time left = -1s

Memory information
------------------
heap = 13107201 doubles = 100.0 Mbytes
stack = 13107201 doubles = 100.0 Mbytes
global = 26214400 doubles = 200.0 Mbytes (distinct from heap & stack)
total = 52428802 doubles = 400.0 Mbytes
verify = yes
hardfail = no

Directory information
---------------------

0 permanent = .
0 scratch = .

NWChem Input Module
-------------------


ncenter= 3

Scaling coordinates for geometry "h2o_c1" by 1.889725989
(inverse scale = 0.529177249)

Turning off AUTOSYM since
SYMMETRY directive was detected!

Geometry "h2o_c1" -> ""
-----------------------

Output coordinates in angstroms (scale by 1.889725989 to convert to a.u.)

No. Tag Charge X Y Z
---- ---------------- ---------- -------------- -------------- --------------
1 O 8.0000 -0.11412678 0.00000000 -0.08291796
2 H 1.0000 -0.11412678 0.00000000 1.11708204
3 H 1.0000 1.02714104 0.00000000 -0.45373835

Atomic Mass
-----------

O 15.994910
H 1.007825
... ...
... ...
NWChem SCF Module
-----------------

ao basis = "ao basis"
functions = 19
atoms = 3
closed shells = 5
open shells = 0
charge = 0.00
wavefunction = RHF
input vectors = atomic
output vectors = ./h2o_opt_dat.movecs
use symmetry = F
symmetry adapt = F


Summary of "ao basis" -> "ao basis" (cartesian)
------------------------------------------------------------------------------
Tag Description Shells Functions and Types
---------------- ------------------------------ ------ ---------------------
O 6-31G* 6 15 3s2p1d
H 6-31G* 2 2 2s

Forming initial guess at 6.4s

Superposition of Atomic Density Guess
-------------------------------------

Sum of atomic energies: -75.75081731
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value

Non-variational initial energy
------------------------------
Total energy = -75.502009
1-e energy = -118.566474
2-e energy = 35.736228
HOMO = 0.000000
LUMO = 0.000000

Starting SCF solution at 6.5s

----------------------------------------------
Quadratically convergent ROHF

Convergence threshold  : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------

** On entry to DGEMM parameter number 13 had an illegal value

#quartets = 1.540D+03 #integrals = 7.659D+03 #direct = 0.0% #cached =100.0%

Integral file = ./h2o_opt_dat.aoints.00
Record size in doubles = 65536 No. of integs per rec = 43688
Max. records in memory = 2 Max. records in file = 144921
No. of bits per label = 8 No. of bits per value = 64


File balance: exchanges= 0 moved= 0 time= 0.0

** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value

iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 7.3282379250 0.00D+00 0.00D+00 2.1
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
4:Segmentation Violation error, status=: 11
(rank:4 hostname:fuji373 pid:480):ARMCI DASSERT fail. ../../ga-5-0/armci/src/signaltrap.c:SigSegvHandler():312 cond:0

I am not sure if this is a known issue or not.

Regards
Dominic Chien

Forum Vet
This is a integre*8 vs integer*4 issue. Your math libraries are probably 32 bit whereas you compiled with -i8?!

Bert


Quote:Chiensh Nov 10th 10:31 am
Hi,

Thank you, Bert!. I believe that I have figured out the problem :

The main cause of this problem was that the original SHMMAX was set to 32GB, which is the half of the total memory of another linux cluster in our site. So someone has copied it over here when the vendor set up this cluster.

We simply set SHMMAX=12GB, which is half of the total memory of this 8-core node, but it resulted in out of memory error in ARMCI when more than 2 cores is used.

Setting SHMMAX value to the half of the physical memory PER CORE (i.e. 1.5GB) has fixed the memory problem for the stable release NWChem, and it runs well with 16 cores across 2 nodes

Now, I got another problem in the most recent developer release (Oct-25)
Regardless of whatever Math library, i.e. MKL ver. 10, 11 and 2011 (12), and GotoBLAS, I linked to this code, I got a running error:

** On entry to DGEMM  parameter number 13 had an illegal value

e.g. h2o_opt.nw in QA
        Job information
---------------
hostname = fuji373
program = ../nwchem-src-2011-Oct-25/bin/LINUX64/nwchem
date = Thu Nov 10 17:49:12 2011
compiled = Thu_Nov_10_17:30:05_2011
source = /home/GENERAL/chiensh/NWCHEM/nwchem-src-2011-Oct-25
nwchem branch = Development
input = h2o_opt.nw
prefix = h2o_opt_dat.
data base = ./h2o_opt_dat.db
status = startup
nproc = 16
time left = -1s

Memory information
------------------
heap = 13107201 doubles = 100.0 Mbytes
stack = 13107201 doubles = 100.0 Mbytes
global = 26214400 doubles = 200.0 Mbytes (distinct from heap & stack)
total = 52428802 doubles = 400.0 Mbytes
verify = yes
hardfail = no

Directory information
---------------------

0 permanent = .
0 scratch = .

NWChem Input Module
-------------------


ncenter= 3

Scaling coordinates for geometry "h2o_c1" by 1.889725989
(inverse scale = 0.529177249)

Turning off AUTOSYM since
SYMMETRY directive was detected!

Geometry "h2o_c1" -> ""
-----------------------

Output coordinates in angstroms (scale by 1.889725989 to convert to a.u.)

No. Tag Charge X Y Z
---- ---------------- ---------- -------------- -------------- --------------
1 O 8.0000 -0.11412678 0.00000000 -0.08291796
2 H 1.0000 -0.11412678 0.00000000 1.11708204
3 H 1.0000 1.02714104 0.00000000 -0.45373835

Atomic Mass
-----------

O 15.994910
H 1.007825
... ...
... ...
NWChem SCF Module
-----------------

ao basis = "ao basis"
functions = 19
atoms = 3
closed shells = 5
open shells = 0
charge = 0.00
wavefunction = RHF
input vectors = atomic
output vectors = ./h2o_opt_dat.movecs
use symmetry = F
symmetry adapt = F


Summary of "ao basis" -> "ao basis" (cartesian)
------------------------------------------------------------------------------
Tag Description Shells Functions and Types
---------------- ------------------------------ ------ ---------------------
O 6-31G* 6 15 3s2p1d
H 6-31G* 2 2 2s

Forming initial guess at 6.4s

Superposition of Atomic Density Guess
-------------------------------------

Sum of atomic energies: -75.75081731
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value

Non-variational initial energy
------------------------------
Total energy = -75.502009
1-e energy = -118.566474
2-e energy = 35.736228
HOMO = 0.000000
LUMO = 0.000000

Starting SCF solution at 6.5s

----------------------------------------------
Quadratically convergent ROHF

Convergence threshold  : 1.000E-04
Maximum no. of iterations : 30
Final Fock-matrix accuracy: 1.000E-07
----------------------------------------------

** On entry to DGEMM parameter number 13 had an illegal value

#quartets = 1.540D+03 #integrals = 7.659D+03 #direct = 0.0% #cached =100.0%

Integral file = ./h2o_opt_dat.aoints.00
Record size in doubles = 65536 No. of integs per rec = 43688
Max. records in memory = 2 Max. records in file = 144921
No. of bits per label = 8 No. of bits per value = 64


File balance: exchanges= 0 moved= 0 time= 0.0

** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value

iter energy gnorm gmax time
----- ------------------- --------- --------- --------
1 7.3282379250 0.00D+00 0.00D+00 2.1
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
** On entry to DGEMM parameter number 13 had an illegal value
4:Segmentation Violation error, status=: 11
(rank:4 hostname:fuji373 pid:480):ARMCI DASSERT fail. ../../ga-5-0/armci/src/signaltrap.c:SigSegvHandler():312 cond:0

I am not sure if this is a known issue or not.

Regards
Dominic Chien

Clicked A Few Times
Quote:Bert Nov 11th 9:23 pm
This is a integre*8 vs integer*4 issue. Your math libraries are probably 32 bit whereas you compiled with -i8?!

Bert


Thanks Bert.

I will check the math libraries, but the stable release NWChem (which was compiled with exactly the same setting) is running well with the same GotoBLAS.

Regards,
Dominic

  • Guest -


Forum >> NWChem's corner >> Running NWChem