QA tests failing


Just Got Here
Hi,

I recently compiled NWChem 6.3 on a AMD Opteron 6378 CPU cluster using the following compiling options:

setenv LARGE_FILES TRUE
setenv USE_NOFSCHECK TRUE
setenv TCGRSH /usr/bin/ssh
setenv NWCHEM_TOP /home/berraas/nwchem-6.3
setenv NWCHEM_TARGET LINUX64
setenv NWCHEM_MODULES all
setenv USE_MPI Y
setenv USE_MPIF y
setenv MPI_LOC /opt/openmpi
setenv MPI_LIB $MPI_LOC/lib
setenv MPI_INCLUDE $MPI_LOC/include
setenv LIBMPI "-lmpi_f90 -lmpi_f77 -lmpi -lrt -lnsl -lutil -ldl -lm -Wl,--export-dynamic -lrt -lnsl -lutil"
setenv MSG_COMMS MPI
setenv BLASOPT "-L/share/apps/atlas-3.10/lib -lf77blas -lcblas -latlas -llapack"

Once compilation ended successfully, I run the QA tests. I did try doqmtestsmpi and doqmtests. Most of the jobs were OK; however there are others that failed:

autosym
dft_s12gh
cosmo_trichloroethene
bsse_dft_trimer
cosmo_h3co
cosmo_h3co_gp
h2o_diag_to_cg_ub3lyp
oh2
dft_cr2
dft_x
dft_ozone
hess_nh3_ub3lyp
pspw
pspw_SiC
pspw_md
paw
pspw_polarizability
pspw_stress
band
tddft_h2o_mxvc20 (NWChem execution failed)
tddft_h2o_uhf_mxvc20 (NWChem execution failed)
hi_zora_sf
o2_zora_so
qmmm_grad0 (NWChem execution failed)
lys_qmmm (NWChem execution failed)
ethane_qmmm (NWChem execution failed)
qmmm_opt0
prop_ch3f
ch3f-lc-wpbe
ch3f-lc-wpbeh
ch3radical_rot
ch3radical_unrot
cho_bp_props
prop_cg_nh3_b3lyp (NWChem execution failed)
acr-camb3lyp-cdfit
acr-camb3lyp-direct
acr_lcblyp
o2_bnl
disp_dimer_ch4
disp_dimer_ch4_cgmin
mep-test
k6h2o (NWChem execution failed)
sif_sodft
h2o_raman_3
h2o_raman_4
tropt-ch3nh2
h3_dirdyvtst
h2o_hcons
etf_hcons
cho_bp_props
(still running)...

among the jobs that failed there are some whose execution never started. Most of them are failed because rounding errors. For instance:

1.- diff autosym.ok.out.nwparse autosym.out.nwparse gives

< Effective nuclear repulsion energy (a.u.) 4265.6221
---
> Effective nuclear repulsion energy (a.u.) 4265.6222

A closer look to autosym.out shows

Effective nuclear repulsion energy (a.u.)    4265.6222084805

while autosym.ok.out has

Effective nuclear repulsion energy (a.u.) 4265.6221237303

So what appears to be a mild rounding error in nwparse files is actually a noticeable difference of roughly 0.00008 a.u.

2.- diff cosmo_trichloroethene.ok.out.nwparse cosmo_trichloroethene.out.nwparse
19c19
< Effective nuclear repulsion energy (a.u.) 311.2037
---
> Effective nuclear repulsion energy (a.u.) 311.2042
29c29
< C 0.0000 -0.0002 0.0000
---
> C -0.0001 -0.0002 0.0000
35,36c35,36
< Effective nuclear repulsion energy (a.u.) 311.2037
< Total DFT energy = -1457.33971
---
> Effective nuclear repulsion energy (a.u.) 311.2042
> Total DFT energy = -1457.33915

so here the errors are one order of magnitude bigger than the previous case: ~0.00055

3.-diff bsse_dft_trimer.ok.out.nwparse bsse_dft_trimer.out.nwparse
23c23
< Frequency 162 229 354 486 616 718
---
> Frequency 162 229 354 486 616 717
25c25
< The Zero-Point Energy (Kcal/mol) = 15.79786
---
> The Zero-Point Energy (Kcal/mol) = 15.79677
27c27
< P.Frequency 162 230 354 485 616 717
---
> P.Frequency 162 229 354 485 616 717

Going through all the failed jobs, the difference are similar to these three examples. Should I be concerned about this?

thanks,

Gets Around
I can't tell you whether or not the differences you observe are large enough to be considered "out of spec". I can tell you that that the qm test scripts are kind of a mess. They include a bunch of jobs that are not considered reliable enough to run nightly. The failure criteria are over-sensitive; you spend a lot of time wading through 6th decimal place differences to find significant differences.

I wrote my own script to streamline the post-build QA process. It creates test-execution scripts for you, using only the subset of tests considered reliable enough for nightlies, then shows test results sorted by severity of deviation from references. It also allows you to control the time cost of tests you are willing to run, e.g. "no expense greater than 5000 core-seconds according to the reference output." A "core-second" is a convenient mishmash of units: it is the reference output job completion wall clock time in seconds multiplied by the number of processors used.

Here's the code: https://github.com/mattbernst/nwflask/blob/master/chemnw/qacheck.py

Suppose you download qacheck.py to your home directory and you have built NWChem under /opt/science/nwchem/Nwchem-dev.revision25890-src.2014-07-18 on a LINUX64 platform.

Then you would do this to run and check all tests that are stable enough for nightly use, allowing for an execution cost of up to 10000 core-seconds per test:

cd ~
cp -r /opt/science/nwchem/Nwchem-dev.revision25890-src.2014-07-18/QA .
cd QA
python ~/qacheck.py --top /opt/science/nwchem/Nwchem-dev.revision25890-src.2014-07-18 --cost 10000 --target LINUX64 --test-root .


That generated two scripts in the working directory, runmpi and runserial. I would try runmpi first and only drop back to runserial if there are unusual problems. I have never personally had serial execution work any better than parallel, at least for the jobs that make it into nightly QA. The script assumes that your just-built nwchem can be found in your PATH. If that is not the case, edit the generated scripts and replace

setenv NWCHEM_EXECUTABLE `which nwchem`

with
setenv NWCHEM_EXECUTABLE /path/to/your/binary/nwchem


Here's how you would run the tests selected above with 4 cores and save the output to a log file:
./runmpi 4 | tee mpi.log


You'll wait a while. The tests are sorted from lowest to highest estimated cost, so tests get slower as the script runs. You can look inside the script to see estimated cost as a comment next to each test. Many tests will self-report failure but most of those failures will be trivial, as you'll see when you run the analysis phase. To analyze the mpi.log generated, do this:

python ~/qacheck.py -l mpi.log

Gets Around
(I had to split my post in two because otherwise the forum gave an error)

You will see a bunch of output like this, with the most severe failures at the bottom:

[{'basic_status': 'failed',
  'name': 'tce_mrcc_bwcc_subgroups',
  'reference': '/home/niels/QA/testoutputs/tce_mrcc_bwcc_subgroups.ok.out.nwparse',
  'score': (0, 3.000053538926295e-10),
  'trial': '/home/niels/QA/testoutputs/tce_mrcc_bwcc_subgroups.out.nwparse'},
 {'basic_status': 'failed',
  'name': 'pspw_md',
  'reference': '/home/niels/QA/testoutputs/pspw_md.ok.out.nwparse',
  'score': (0, 1.9999999999242846e-05),
  'trial': '/home/niels/QA/testoutputs/pspw_md.out.nwparse'},
 {'basic_status': 'failed',
  'name': 'sadsmall',
  'reference': '/home/niels/QA/testoutputs/sadsmall.ok.out.nwparse',
  'score': (0, 0.00010000000000001674),
  'trial': '/home/niels/QA/testoutputs/sadsmall.out.nwparse'},
 {'basic_status': 'failed',
  'name': 'autosym',
  'reference': '/home/niels/QA/testoutputs/autosym.ok.out.nwparse',
  'score': (0, 0.00010000000020227162),
  'trial': '/home/niels/QA/testoutputs/autosym.out.nwparse'},
 {'basic_status': 'failed',
  'name': 'ch3radical_rot',
  'reference': '/home/niels/QA/testoutputs/ch3radical_rot.ok.out.nwparse',
  'score': (0, 0.0009999999999763531),
  'trial': '/home/niels/QA/testoutputs/ch3radical_rot.out.nwparse'},
 {'basic_status': 'failed',
  'name': 'ch3radical_unrot',
  'reference': '/home/niels/QA/testoutputs/ch3radical_unrot.ok.out.nwparse',
  'score': (0, 0.0009999999999763531),
  'trial': '/home/niels/QA/testoutputs/ch3radical_unrot.out.nwparse'},
 {'basic_status': 'failed',
  'name': 'prop_ch3f',
  'reference': '/home/niels/QA/testoutputs/prop_ch3f.ok.out.nwparse',
  'score': (0, 0.0009999999999976694),
  'trial': '/home/niels/QA/testoutputs/prop_ch3f.out.nwparse'},
 {'basic_status': 'failed',
  'name': 'h2o-response',
  'reference': '/home/niels/QA/testoutputs/h2o-response.ok.out.nwparse',
  'score': (0, 0.01200000000000001),
  'trial': '/home/niels/QA/testoutputs/h2o-response.out.nwparse'}]
Total 140 passed 132 failed 8


The score for each failed test is a tuple:

(num_gross_failures, total_numeric_deviation)

Gross failures are rarer and bear more scrutiny. They indicate that e.g. reference and trial outputs had different numbers of numeric values in a line of output, or that an output file is entirely missing some section that belongs in the .nwparse. The numeric deviation part of the score is just a sum of all absolute numeric differences between the reference and trial nwparse files.

Run 'diff' on the trial and reference files to see how bad the problem really is, e.g.:

niels@bohr:~/QA$ diff /home/niels/QA/testoutputs/prop_ch3f.ok.out.nwparse /home/niels/QA/testoutputs/prop_ch3f.out.nwparse
169c169
< anisotropy = 37.128
---
> anisotropy = 37.129


I don't think I am going to worry about such a minor difference.

This appears more serious:

niels@bohr:~/QA$ diff /home/niels/QA/testoutputs/h2o-response.ok.out.nwparse /home/niels/QA/testoutputs/h2o-response.out.nwparse
4c4
< Anisotropic = 2.693
---
> Anisotropic = 2.705


That looks significant to me. But the reference file was generated with NWChem 6.1, which is rather an old release. Has the code changed since 6.1 so that the reference value needs updating? Is there excessive numerical error in my result, due to how the code was built? Answering this latter question is particularly difficult now that there are no official binary builds to check against.

Just Got Here
Thanks for your reply Mernst!!!! I really appreciate it.

I copy your script to my home directory and follow your instructions to run it however I am getting the following error:

python ~/qacheck.py --top ~/nwchem-6.3 --target LINUX64 --cost 10000 --test-root .
File "/home/berraas/qacheck.py", line 154
tests[path[-1]] = {entry}
^
SyntaxError: invalid syntax

my version of python is 2.6.6. Also, I noticed your script searchs for "doNightly*" but i have no such files under the QA folder. Could that be the problem?

thanks again

Gets Around
Hi Eberrios,

I changed uses of
{foo}
to
set(foo)
for compatibility with python 2.6. You will need to pull the code again. You will also need to manually install the argparse module to run this code on 2.6: http://stackoverflow.com/questions/15330175/how-can-i-get-argparse-in-python-2-6

I don't know if there are other missing modules or features that may prevent the code from running under 2.6. I only have 2.7 and 3.4 available to me.

Your NWChem 6.3 seems to be rather old. The doNightlyTests.mpi script under QA was added in July 2013. You will need a version that includes that script for my qacheck program to work.


Forum >> NWChem's corner >> General Topics