DFT Frequency Analysis failing on Infiniband Cluster


Clicked A Few Times
While trying to run a dft frequency analysis of a molecule (around 50 atoms and 696 basis functions), the job systematically fails when running it on 40, 60, 80 or 120 cores. The job ends ok when I use 20 or 160 cores. I am using 2 GB per core of total memory. Can anyone give me some advice of how can I solve this problem?

This is the error message:

20: error ival=4
(rank:20 hostname:11 pid:62701):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_call_data_server():2189 cond:(pdscr->status==IBV_WC_SUCCESS)
40: error ival=4
(rank:40 hostname:12 pid:67120):ARMCI DASSERT fail. ../../ga-5-2/armci/src/devices/openib/openib.c:armci_call_data_server():2189 cond:(pdscr->status==IBV_WC_SUCCESS)

Thanks

Alfredo

Description of the cluster:
Infiniband Cluster
20 cores per node
64 GB per node

NWChem information
nwchem branch   = 6.3
nwchem revision = 24652
ga revision = 10379
Compiled using ifort (IFORT) 14.0.2 20140120 and OpenMPI 1.6.5

Forum Vet
Aguevara
You NWChem failure is likely to be memory related.
I have a few related question to ask you:
Did you set the environmental variable ARMCI_DEFAULT_SHMMAX?
What is the memory line in your input file?

Thanks, Edo

Clicked A Few Times
Edo
The memory line in my input file is:

memory total 2048 Mb

I tried with the value of ARMCI_DEFAULT_SHMMAX=65536 and without setting it
The SHMMAX in my system is
[ag@nc9 ~]$ cat /proc/sys/kernel/shmmax
68719476736

Thanks

Alfredo

Forum Vet
Alfredo
I think I have experienced myself similar problems on Infiniband networks.
In order to better understand what is going on, we need to have a closer look at the problem.
1) Did you see any other error/warning message (either in the error or output file), for example relative to memory?
2) Your kernel setting is correct, however, there might be something on the openib side preventing the value to be set correctly.
Therefore, we need to look at what value of SHMMAX is actually used during your NWChem runs
To do this, I suggest you to recompile the tools after applying a patch.
Here is what you should do
1) cd $NWCHEM_TOP/src/
2) wget http://nwchemgit.github.io/images/Reportshmmax.patch.gz
3) gzip -d Reportshmmax.patch.gz
4) patch -p0 < Reportshmmax.patch
5) cd tools/build
6) make install
7) cd ../..
8) make link

If you now try to run NWChem, you should be getting -- in the initial part of the output file -- a line that reports the value of SHMMAX

Once we are sure about the value of SHMMAX being used, we might have to look at problems of memory registration in openib

Clicked A Few Times
Hi Edo,
These are the results:

Using ARMCI_DEFAULT_SHMMAX=65536

argument  1 = fen_07.in
incorrect ARMCI_DEFAULT_SHMMAX should be <1,ARMCI_DEFAULT_SHMMAX>mb and 2^N Found=65536
0 using x=256 SHMMAX=262144KB
incorrect ARMCI_DEFAULT_SHMMAX should be <1,ARMCI_DEFAULT_SHMMAX>mb and 2^N Found=65536
incorrect ARMCI_DEFAULT_SHMMAX should be <1,ARMCI_DEFAULT_SHMMAX>mb and 2^N Found=65536
60 using x=256 SHMMAX=262144KB
40 using x=256 SHMMAX=262144KB
incorrect ARMCI_DEFAULT_SHMMAX should be <1,ARMCI_DEFAULT_SHMMAX>mb and 2^N Found=65536
20 using x=256 SHMMAX=262144KB


Using ARMCI_DEFAULT_SHMMAX=20480 (1024*20 cores)

argument 1 = fen_07.in
incorrect ARMCI_DEFAULT_SHMMAX should be <1,ARMCI_DEFAULT_SHMMAX>mb and 2^N Found=20480
0 using x=256 SHMMAX=262144KB
incorrect ARMCI_DEFAULT_SHMMAX should be <1,ARMCI_DEFAULT_SHMMAX>mb and 2^N Found=20480
incorrect ARMCI_DEFAULT_SHMMAX should be <1,ARMCI_DEFAULT_SHMMAX>mb and 2^N Found=20480
incorrect ARMCI_DEFAULT_SHMMAX should be <1,ARMCI_DEFAULT_SHMMAX>mb and 2^N Found=20480
20 using x=256 SHMMAX=262144KB
60 using x=256 SHMMAX=262144KB
40 using x=256 SHMMAX=262144KB

Using ARMCI_DEFAULT_SHMMAX=8096
argument  1 = fen_07.in
0 using x=8096 SHMMAX=8290304KB
20 using x=8096 SHMMAX=8290304KB
60 using x=8096 SHMMAX=8290304KB
40 using x=8096 SHMMAX=8290304KB

Is there a maximum value of ARMCI_DEFAULT_SHMMAX that I can use?

Thanks for your help

Alfredo

Forum Vet
Quote:Aguevara Jul 18th 11:58 am

Using ARMCI_DEFAULT_SHMMAX=8096
argument  1 = fen_07.in
0 using x=8096 SHMMAX=8290304KB
20 using x=8096 SHMMAX=8290304KB
60 using x=8096 SHMMAX=8290304KB
40 using x=8096 SHMMAX=8290304KB

Is there a maximum value of ARMCI_DEFAULT_SHMMAX that I can use?

Thanks for your help

Alfredo


With the vanilla 6.3 code, the maximum value turns out to be 8192.
If you want to increase it to 65536, you would need to recompile the tools with
the option MAYBE_FFLAGS="ARMCI_DEFAULT_SHMMAX_UBOUND=65536"
In other words, here is what you need to do

1) cd $NWCHEM_TOP/src/tools
2) rm -rf build
3) make MAYBE_FFLAGS="ARMCI_DEFAULT_SHMMAX_UBOUND=65536"
...

Clicked A Few Times
Edo

I changed the value to 8192 and it works fine.
Thank you

Alfredo

Forum Vet
Quote:Aguevara Jul 18th 1:40 pm
Edo

I changed the value to 8192 and it works fine.
Thank you

Alfredo


Does this mean that the calculation can run to completion?

Clicked A Few Times
That's correct.
Do you think if I recompile the tools to accept larger values than 8192 I could see a better performance of the code?
Is there a way to know in general the best value of ARMCI_DEFAULT_SHMMAX?

Alfredo

Forum Vet
Alfredo
In principle, ARMCI_DEFAULT_SHMMAX should be equal to the number of cores *
maximum amount of global memory you are going to use.


Forum >> NWChem's corner >> Running NWChem