1:15:35 AM PDT - Tue, Oct 21st 2014 |
|
Quote:Edoapra Oct 20th 12:03 pmQuote:Mpacey Oct 20th 1:24 amQuote:Mpacey Oct 20th 2:21 amI've been trying to start a new thread, but pressing Submit gives the error message:
The specified URL cannot be found
As I seem to be able to post here, I thought I'd cut and paste my error report into another reply - but I got the same error message. Is there some limit on long posts?
There should not be one.
Anyhow, please post your problems here.
I'll try to split my post in half:
I’ve built a very vanilla MPI version of Nwchem-6.5 from osurce on our local cluster, and I ran the doqmtests.mpi script in the QA directory to check the numerical accuracy. I stopped the run during tce_hyperpolar_ccsd_small after 12+ hours of running, but I’m already seeing several failures in earlier tests (details in the next post). My build process is this:
module add openmpi/1.8.1-gcc
export NWCHEM_TOP=/usr/shared_apps/packages/src/Nwchem-6.5
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/usr/shared_apps/packages/openmpi-1.8.1-gcc/bin/mpicc
export MPI_LIB=/usr/shared_apps/packages/openmpi-1.8.1-gcc/lib
export MPI_INCLUDE=/usr/shared_apps/packages/openmpi-1.8.1-gcc/include
export LIBMPI="-lmpi_usempi -lmpi_mpifh -lmpi -lpthread"
cd $NWCHEM_TOP/src
make nwchem_config
make
The gcc version is 4.4.7, and OpenMPI 1.8.1 was built with the same compiler. The build system is a 12-core Westmere server running Scientific Linux 6.5. The tests were run with 16 cores on a 16-core Ivy Bridge system with the same OS.
I’ve included a summary of failures at the bottom, with details manually extracted from the testoutputdir (which prompts the question: have I missed an automated tool to help me here?). If I understand correctly, the test script runs a diff of $testname.ok.out.nwparse (the gold standard?) and $testname.out.nwparse (the job output) with the nwparse filename component indicating that it’s been passed through the nwparse.pl script to extract the relevant output lines to diff?
Most of the errors are down in the 4th sig fig, meaning that the relative error is low, but not being a chemist (I’m a sysadmin with a comp sci background) I’m not sure how significant such differences are, nor if they’re likely to propagate to larger errors in larger models. (And in one case, the answer is wrong in the first sig fig). I'd like to understand the implications of the test failures and possibly fix them before making this application generally available to my users.
I also have a follow on question: once I do get the numerics right I’m looking to create an optimised version (e.g. using Intel’s MKL, and optimising for a more modern architecture than the build process’ default of Nocona). I note that the build process defaults to using the Gnu compiler flag –ffast-math, which will produce non-IEEE 754 compliant results. Are the 'gold standard' outputs produced using non-IEE 754 compliant optimisation flags? My concern is that if I’m comparing a IEEE 754 compliant optimised build to a ‘gold standard’ output known not to be IEEE 754 compliant, I’m likely to see more test failures even if the numeric results are technically more accurate. I note from the FAQ that you’re understandably hesitant to assist in individual optimised builds, but I’m wondering if you have any general advice?
Regards,
Mike.
|