QA tests and verification for nwchem 6.5 and hess/vib modules


Jump to page 12Next 16Last
Clicked A Few Times
I've been compiling nwchem 6.5 (Nwchem-6.5.revision26243-src.2014-09-10) on a number of different platforms:

- CentOS 5.x with OpenMPI 1.6.4 and IB and gcc 4.4
- CentOS 5.x with OpenMPI 1.6.4 and IB and Intel compilers 2013 and 2013_SP1
- CentOS 6.x with MVAPICH2 1.9 and IB (Stampede at TACC) and gcc default
- CentOS 6.x with MVAPICH2 1.9 and IB (Stampede at TACC) and Intel compilers 2013

and I've noticed some failures with the following patterns:

On CentOS5.x platform, compiling with the Intel 2013 compilers results in errors in the properties module. Here is a diff from the QA tests:

http://pastebin.com/ivr85r5f

Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works fine on this platform.

On the CentOS6.x patform, running the QA tests on a single node, 16 processors seems to work for all QA tests for both compilers. Running the QA tests on 3-nodes, 16 processors each, results in the following frequency variations for both compilers:

http://pastebin.com/XFegynU3

It looks like there could be some bugs in the compiling/linking step for the hessian/vibration portions of the code.

Is there a way to selectively control the compiling flags for specific modules? What additional information is needed to address these issues?

Thanks.

Forum Vet
Statistics
Thank you very much for the feedback.
Could you please provide more details about your installation?
For example
Value of ARMCI_NETWORK variable
Detailed compiler version (output of ifort -V or gfortran -v)

Forum Vet
Quote:Statics Sep 19th 10:02 am

Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works file on this platform.


Do you mean: " ... GCC44 and Intel 2013_SP1 works fine on this platform.." ?
Are you stating that all the failures occur with Intel compilers 2013, while 2013_SP1 works fine?

I am not quite sure of what you see with CentOS 6.x on TACC Stamped ... do you have any compiler version that seems to work?

Once again, for any case please provide a detailed compiler version.

Cheers, Edo

Clicked A Few Times
Quote:Edoapra Sep 19th 6:48 pm
Quote:Statics Sep 19th 10:02 am

Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works file on this platform.


Do you mean: " ... GCC44 and Intel 2013_SP1 works fine on this platform.." ?
Are you stating that all the failures occur with Intel compilers 2013, while 2013_SP1 works fine?

I am not quite sure of what you see with CentOS 6.x on TACC Stamped ... do you have any compiler version that seems to work?

Once again, for any case please provide a detailed compiler version.

Cheers, Edo


Yes, sorry for the typo, I meant fine. Here are the compiler versions:

CentOS 5.x:
Working:
gcc 4.4: gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort 2013 SP1: ifort (IFORT) 14.0.2 20140120

Not working:
ifort 2013: ifort (IFORT) 13.0.1 20121010

CentOS 6.x: works on a single node, not working on multiple nodes. No compiler/MPI library combination has been able to successfully pass the multiple node frequency QA tests.
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
ifort (IFORT) 13.1.0 20130121

Clicked A Few Times
Quote:Edoapra Sep 19th 6:39 pm
Statistics
Thank you very much for the feedback.
Could you please provide more details about your installation?
For example
Value of ARMCI_NETWORK variable
Detailed compiler version (output of ifort -V or gfortran -v)


ARMCI_NETWORK on all compiles are OPENIB. Would it help to have the full build environment on pastebin?

Forum Vet
ARMCI_OPENIB_DEVICE=mlx4_0
By the way, when you run nwchem on stampede do you set the environmental variable
ARMCI_OPENIB_DEVICE equal to mlx4_0?

Forum Vet
Please try the following

cd $NWCHEM_TOP/src/NWints/hondo
touch hnd_giaxyz.F
make FC=ifort FOPTIMIZE="-O0 -g" FDEBUG="-O0 -g"
cd ../..
make FC=ifort link

This should fix the prop_ch3f problem

Clicked A Few Times
Quote:Edoapra Sep 19th 11:21 pm
By the way, when you run nwchem on stampede do you set the environmental variable
ARMCI_OPENIB_DEVICE equal to mlx4_0?


Yes, I do set that variable.

Clicked A Few Times
CentOS 5.x summary
I have some more data regarding the issues on CentOS 5.x. Hopefully this table will help:

CentOS 5.x with OpenMPI 1.6.4, Nwchem-6.5.revision26243-src.2014-09-10


gcc 4.4
ifort 2013
ifort 2013 w/ hondo fix
ifort 2013 SP1
Compiler version
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 14.0.2 20140120
Passes most tests
Failure description
N/A
Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
N/A
TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault  
========================================= Excited-state calculation ( b2 symmetry)
========================================= Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11


Clicked A Few Times
CentOS 6.x summary
I also have some more information regarding NWChem on CentOS 6.x (stampede). It currently looks like the problem is related to the parallelization.

I was able to reproduce the problems using the hess_h2o QA test using the default compiled 6.3 version on the system. The error/symptom is identical to that observed with 6.5.

I tried a number of different MPI and Intel compiler versions all with the same problem; however, it looks like the problems is related to the parallelization and number and distribution of cores.

I'm still working through the scenarios. The original tests were run on 3 nodes, 16 ppn. They failed with the odd frequency values. But it looks like parallelizing the test using 24 cores (either 3:ppn=8 or 4:ppn=6) fails, but 2:ppn=12 works. However, 4:ppn=16 works so it doesn't look to be an upper bound issue.

So, it looks like there is a working version of 6.5 on the system; however, the accuracy of the frequency values depends upon the parallelization.

Forum Vet
Quote:Statics Sep 22nd 10:21 am
I have some more data regarding the issues on CentOS 5.x. Hopefully this table will help:

CentOS 5.x with OpenMPI 1.6.4, Nwchem-6.5.revision26243-src.2014-09-10


gcc 4.4
ifort 2013
ifort 2013 w/ hondo fix
ifort 2013 SP1
Compiler version
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 14.0.2 20140120
Passes most tests
Failure description
N/A
Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
N/A
TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault  
========================================= Excited-state calculation ( b2 symmetry)
========================================= Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11



Thanks for the detailed report

I was able to reproduce the segv with ifort 14.02

Forum Vet
Here is the fix for the Intel 14.0.2 SegV you reported

cd $NWCHEM_TOP/src
wget http://nwchemgit.github.io/images/Hbar.patch.gz
gzip -d Hbar.patch
patch -p0 < Hbar.patch
cd tce
make FC=ifort
cd ..
make FC=ifort link

Thanks again for the detailed and useful bug report

Clicked A Few Times
Updated CentOS 5.x summary. Thanks for the patch. Intel 2013SP1 now passes most tests and is about 8% faster than the gcc44 version based on the wall clock time of the many small QA tests in md and qm-fast set.


gcc 4.4
ifort 2013
ifort 2013 w/ hondo fix
ifort 2013 SP1
ifort 2013 SP1 w/ tce fix
Compiler version
gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 13.0.1 20121010
ifort (IFORT) 14.0.2 20140120
ifort (IFORT) 14.0.2 20140120
Passes most tests
Failure description
N/A
Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
N/A
TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault  
========================================= Excited-state calculation ( b2 symmetry)
========================================= Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11
N/A

Forum Vet
You might want to add the patch mentioned at
http://nwchemgit.github.io/Special_AWCforum/sp/id5149

Clicked A Few Times
Quote:Edoapra Sep 23rd 6:03 pm
You might want to add the patch mentioned at
http://nwchemgit.github.io/Special_AWCforum/sp/id5149


Yes, I'll look at that and the other patches available.

Thanks.

Clicked A Few Times
Hello

I am trying to get the vibrational modes for the following system using DFT. I am using NWChem 6.5 and I am not able to get the vibrational modes. I was not sure if the patch which is given in this page or http://nwchemgit.github.io/Special_AWCforum/sp/id5149 would apply for my system too and if that is the only rectification available.


echo
title "cluster"
memory total 2000 mb #I thought segmentation fault was due to some memory issues, so increase total from 400 to 2000mb.

geometry autosym
Pd -3.78493565 1.99241247 -0.78722757
Pd                   -3.14879170    -0.73451269    -0.78726064
Pd -2.51264775 -3.46143787 -0.78729369
Pd -1.74142116 3.90679187 -0.78723053
Pd -1.11988907 1.19421949 -0.88090579
Pd -0.47503225 -1.56660404 -0.88041514
Pd 0.16701070 -4.27398363 -0.78732972
Pd 0.93823729 3.09424611 -0.78726656
Pd 1.59464768 0.37329742 -0.88143236
Pd 2.21052519 -2.35960422 -0.78733268
Pd 3.61789574 2.28170036 -0.78730258
Pd 4.25403969 -0.44522482 -0.78733564
Pd -2.67964416 0.81256671 1.49905647
Pd -2.04350020 -1.91435845 1.49902340
Pd -0.63612966 2.72694612 1.49905350
Pd 0.00001429 0.00002095 1.49902044
Pd 0.63615824 -2.72690422 1.49898738
Pd 2.04352879 1.91440036 1.49901748
Pd 2.67967273 -0.81252482 1.49898442
C -0.00087599 -0.00094796 -2.10286302
O -0.00115233 -0.00065097 -3.29875807
end


basis "large" cartesian
Pd library lanl2dz_ecp file /usr/local/NWChem/data/libraries/
C library 6-311G** file /usr/local/NWChem/data/libraries/
O library 6-311G** file /usr/local/NWChem/data/libraries/
H library 6-311G** file /usr/local/NWChem/data/libraries/
end

ecp
Pd library lanl2dz_ecp file /usr/local/NWChem/data/libraries/
end
set geometry:actlist 5 6 9 20 21
set "ao basis" "large"

dft
vectors input dft-freq.movecs output dft-freq-2.movecs
iterations 500
direct
mult 1
XC xpbe96 cpbe96
convergence ncyds 1000 damp 70 ncydp 100 diis 16 #default diis also gives the same error
smear 0.001
end

DRIVER
XYZ a.xyz
MAXITER 500
END

freq
temp 1 298
end
task dft freq


I am always getting the following error:

stpr_wrt_fd_from_sq: overwrite of existing file:./cluster.hess
stpr_wrt_fd_dipole: overwrite of existing file./cluster.fd_ddipole

HESSIAN: the one electron contributions are done in     134.5s

14:Segmentation Violation error, status=: 11
(rank:14 hostname:hpc-6 pid:22410):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
rank 14 in job 10 hpc-6_59277 caused collective abort of all ranks
 exit status of rank 14: return code 11

Thanks for any help.


Forum >> NWChem's corner >> Compiling NWChem
Jump to page 12Next 16Last