QA tests and verification for nwchem 6.5 and hess/vib modules


Jump to page 12Next 16Last
Clicked A Few Times
I've been compiling nwchem 6.5 (Nwchem-6.5.revision26243-src.2014-09-10) on a number of different platforms:

- CentOS 5.x with OpenMPI 1.6.4 and IB and gcc 4.4
- CentOS 5.x with OpenMPI 1.6.4 and IB and Intel compilers 2013 and 2013_SP1
- CentOS 6.x with MVAPICH2 1.9 and IB (Stampede at TACC) and gcc default
- CentOS 6.x with MVAPICH2 1.9 and IB (Stampede at TACC) and Intel compilers 2013

and I've noticed some failures with the following patterns:

On CentOS5.x platform, compiling with the Intel 2013 compilers results in errors in the properties module. Here is a diff from the QA tests:

http://pastebin.com/ivr85r5f

Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works fine on this platform.

On the CentOS6.x patform, running the QA tests on a single node, 16 processors seems to work for all QA tests for both compilers. Running the QA tests on 3-nodes, 16 processors each, results in the following frequency variations for both compilers:

http://pastebin.com/XFegynU3

It looks like there could be some bugs in the compiling/linking step for the hessian/vibration portions of the code.

Is there a way to selectively control the compiling flags for specific modules? What additional information is needed to address these issues?

Thanks.

Forum Vet
Statistics
Thank you very much for the feedback.
Could you please provide more details about your installation?
For example
Value of ARMCI_NETWORK variable
Detailed compiler version (output of ifort -V or gfortran -v)

Forum Vet
Quote:Statics Sep 19th 10:02 am

Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works file on this platform.


Do you mean: " ... GCC44 and Intel 2013_SP1 works fine on this platform.." ?
Are you stating that all the failures occur with Intel compilers 2013, while 2013_SP1 works fine?

I am not quite sure of what you see with CentOS 6.x on TACC Stamped ... do you have any compiler version that seems to work?

Once again, for any case please provide a detailed compiler version.

Cheers, Edo

Clicked A Few Times
Quote:Edoapra Sep 19th 6:48 pm
Quote:Statics Sep 19th 10:02 am

Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works file on this platform.


Do you mean: " ... GCC44 and Intel 2013_SP1 works fine on this platform.." ?
Are you stating that all the failures occur with Intel compilers 2013, while 2013_SP1 works fine?

I am not quite sure of what you see with CentOS 6.x on TACC Stamped ... do you have any compiler version that seems to work?

Once again, for any case please provide a detailed compiler version.

Cheers, Edo


Yes, sorry for the typo, I meant fine. Here are the compiler versions:

CentOS 5.x:
Working:
gcc 4.4: gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort 2013 SP1: ifort (IFORT) 14.0.2 20140120

Not working:
ifort 2013: ifort (IFORT) 13.0.1 20121010

CentOS 6.x: works on a single node, not working on multiple nodes. No compiler/MPI library combination has been able to successfully pass the multiple node frequency QA tests.
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
ifort (IFORT) 13.1.0 20130121

Clicked A Few Times
Quote:Edoapra Sep 19th 6:39 pm
Statistics
Thank you very much for the feedback.
Could you please provide more details about your installation?
For example
Value of ARMCI_NETWORK variable
Detailed compiler version (output of ifort -V or gfortran -v)


ARMCI_NETWORK on all compiles are OPENIB. Would it help to have the full build environment on pastebin?

Forum Vet
ARMCI_OPENIB_DEVICE=mlx4_0
By the way, when you run nwchem on stampede do you set the environmental variable
ARMCI_OPENIB_DEVICE equal to mlx4_0?

Forum Vet
Please try the following

cd $NWCHEM_TOP/src/NWints/hondo
touch hnd_giaxyz.F
make FC=ifort FOPTIMIZE="-O0 -g" FDEBUG="-O0 -g"
cd ../..
make FC=ifort link

This should fix the prop_ch3f problem

Clicked A Few Times
Quote:Edoapra Sep 19th 11:21 pm
By the way, when you run nwchem on stampede do you set the environmental variable
ARMCI_OPENIB_DEVICE equal to mlx4_0?


Yes, I do set that variable.

Clicked A Few Times
CentOS 5.x summary
I have some more data regarding the issues on CentOS 5.x. Hopefully this table will help:

CentOS 5.x with OpenMPI 1.6.4, Nwchem-6.5.revision26243-src.2014-09-10


gcc 4.4
ifort 2013
ifort 2013 w/ hondo fix
ifort 2013 SP1
Compiler version
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 14.0.2 20140120
Passes most tests
Failure description
N/A
Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
N/A
TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault  
========================================= Excited-state calculation ( b2 symmetry)
========================================= Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11


Clicked A Few Times
CentOS 6.x summary
I also have some more information regarding NWChem on CentOS 6.x (stampede). It currently looks like the problem is related to the parallelization.

I was able to reproduce the problems using the hess_h2o QA test using the default compiled 6.3 version on the system. The error/symptom is identical to that observed with 6.5.

I tried a number of different MPI and Intel compiler versions all with the same problem; however, it looks like the problems is related to the parallelization and number and distribution of cores.

I'm still working through the scenarios. The original tests were run on 3 nodes, 16 ppn. They failed with the odd frequency values. But it looks like parallelizing the test using 24 cores (either 3:ppn=8 or 4:ppn=6) fails, but 2:ppn=12 works. However, 4:ppn=16 works so it doesn't look to be an upper bound issue.

So, it looks like there is a working version of 6.5 on the system; however, the accuracy of the frequency values depends upon the parallelization.

Forum Vet
Quote:Statics Sep 22nd 10:21 am
I have some more data regarding the issues on CentOS 5.x. Hopefully this table will help:

CentOS 5.x with OpenMPI 1.6.4, Nwchem-6.5.revision26243-src.2014-09-10


gcc 4.4
ifort 2013
ifort 2013 w/ hondo fix
ifort 2013 SP1
Compiler version
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 14.0.2 20140120
Passes most tests
Failure description
N/A
Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
N/A
TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault  
========================================= Excited-state calculation ( b2 symmetry)
========================================= Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11



Thanks for the detailed report

I was able to reproduce the segv with ifort 14.02

Forum Vet
Here is the fix for the Intel 14.0.2 SegV you reported

cd $NWCHEM_TOP/src
wget http://nwchemgit.github.io/images/Hbar.patch.gz
gzip -d Hbar.patch
patch -p0 < Hbar.patch
cd tce
make FC=ifort
cd ..
make FC=ifort link

Thanks again for the detailed and useful bug report

Clicked A Few Times
Updated CentOS 5.x summary. Thanks for the patch. Intel 2013SP1 now passes most tests and is about 8% faster than the gcc44 version based on the wall clock time of the many small QA tests in md and qm-fast set.


gcc 4.4
ifort 2013
ifort 2013 w/ hondo fix
ifort 2013 SP1
ifort 2013 SP1 w/ tce fix
Compiler version
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 14.0.2 20140120
ifort (IFORT) 14.0.2 20140120
Passes most tests
Failure description
N/A
Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
N/A
TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault  
========================================= Excited-state calculation ( b2 symmetry)
========================================= Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11
N/A

Forum Vet
You might want to add the patch mentioned at
http://nwchemgit.github.io/Special_AWCforum/sp/id5149

Clicked A Few Times
Quote:Edoapra Sep 23rd 6:03 pm
You might want to add the patch mentioned at
http://nwchemgit.github.io/Special_AWCforum/sp/id5149


Yes, I'll look at that and the other patches available.

Thanks.


Forum >> NWChem's corner >> Compiling NWChem
Jump to page 12Next 16Last