Hardware recommendation


Clicked A Few Times
I am building a mini Cluster for abinitio simulations (mainly NWChem).

1) For large simulations, huge scratch files are created and even swap file is used. So, is there any real performance gain in putting 2 SSD's in RAID 0? Or should i go with 1 fast SSD with double capacity and save a few money (and also worry less for a failure)? In other words, does a 6-core cpu (5820K) read/write data faster than a SSD can provide? Are those reads/writes usually sequential or random?

2) Is Gigabit Ethernet generally fast enough for running a simulation parallel on 4/8 nodes (each node will have a 5820K processor)? Is there any point in purchasing additional GbE cards on each node and using them together (with NIC bonding) with the motherboard's provided GbE?

3) Should i put local disks on each node, or build a stateless cluster with fast SSD's only on master node?

While i am aware that these questions are not easily answered generally, i am specifically interested in NWChem performance, especially CC (CCSD(T), CREOMCCSD(T), IPCCSD), RT-TDDFT calculations.


Thanks in advance for any advice,

Kostas

Gets Around
I think that those coupled cluster methods you are interested in will require a lot of RAM. Have you managed to get them working with disk based IO schemes? I couldn't, and I was told that the disk based IO schemes should be considered unsupported: http://nwchemgit.github.io/Special_AWCforum/st/id1360/Floating_Point_Exception_usi...

If you are going to frequently run coupled cluster jobs, I would skip spending money on fast disks for each node and shift the money to RAM. For TDDFT I don't know.

I would guess that NWChem file access patterns are largely sequential for big files, but there is no need to guess. You can run a trial job under strace and look at the file access calls to see the access patterns.

If you are an academic you should also get Intel's compilers and MKL and use them to build NWChem. It's free for academics and will give you a modest performance boost over using GNU tools. That should help you get the most performance per dollar out of your hardware budget.

Clicked A Few Times
Thank you Mernst for the strace tip; i will definitely try it.

I have not tried any disk based IO scheme; as far a i understand, i am using the default GA algorithm. Yet, large simulations produce huge files. I know storage mediums can never substitute RAM and i am planning to get as much memory as it fits in the motherboard. I was thinking that having fast disk access might still be somehow useful, but i may be wrong.

As for Intel compilers, are you sure they are free for academics? Their website lists Parallel Studio for C/C++ as an offer for students, but no Fortran compiler is mentioned.

Gets Around
If your jobs lead to large files then fast disks may well be worthwhile. I would say profile first. See how much data is cumulatively written/read during a trial job. See how much more wall clock time than CPU time your job takes. The surplus wall clock time is mostly down to waiting on IO. If you have slow disks then your disk-heavy jobs run slower but if you don't have enough RAM, larger coupled cluster jobs can't run at all. In theory, direct methods that just recompute integrals on the fly are faster than disk-caching methods if you have a high enough ratio of CPU power to disk speed. In practice I have never seen direct methods run faster, even when I was using slow spinning disks in laptops.

The Intel Parallel Studio includes a Fortran compiler despite the misleading way it's named on some of their pages. If you qualify for that you should be fine. It has been some years since I looked at Intel's free software. I had thought that the academic, open source, and student offerings were basically the same last I looked. Academic currently appears more restricted than the others. For such a small compute cluster I don't think that buying Intel's software will provide a bigger speed boost than buying hardware, if you can't get a free license.

Gets Around
Quote:Extremis May 22nd 12:27 pm
I am building a mini Cluster for abinitio simulations (mainly NWChem).

1) For large simulations, huge scratch files are created and even swap file is used. So, is there any real performance gain in putting 2 SSD's in RAID 0? Or should i go with 1 fast SSD with double capacity and save a few money (and also worry less for a failure)? In other words, does a 6-core cpu (5820K) read/write data faster than a SSD can provide? Are those reads/writes usually sequential or random?

2) Is Gigabit Ethernet generally fast enough for running a simulation parallel on 4/8 nodes (each node will have a 5820K processor)? Is there any point in purchasing additional GbE cards on each node and using them together (with NIC bonding) with the motherboard's provided GbE?

3) Should i put local disks on each node, or build a stateless cluster with fast SSD's only on master node?

While i am aware that these questions are not easily answered generally, i am specifically interested in NWChem performance, especially CC (CCSD(T), CREOMCCSD(T), IPCCSD), RT-TDDFT calculations.


Disclaimer: I work for Intel Corporation and thus I have an obvious conflict of interest with respect to processor choice, hence will not comment on that. Note also that Intel has multiple interconnect products, including True Scale InfiniBand, so I will refer to InfiniBand generically, without specifying an implementer. Finally, Intel also makes SSDs, but my comments on that topic are in no way implementation-specific.

First, know that PNNL builds a supercomputer to run NWChem effectively (among other goals, obviously, but NWChem is a big one). They chose the configuration described here: https://www.emsl.pnl.gov/emslweb/instruments/computing-cascade-atipa-1440-intel-xeon-phi-n.... I had nothing to do with the design or procurement of this machine.

Regarding your specific questions:

1) the TCE is designed to run in memory, not using disk. You can find many references on this site on this issue. The modules where fast disks help are semidirect MP2 and semidirect CCSD(T), the latter being the non-TCE implementation.

2) Gigabit Ethernet is not a good interconnect for Global Arrays. It is almost certainly the worst possible choice you could make in this respect. A network with one-sided capability, e.g. InfiniBand, is appropriate for Global Arrays. If you try to run NWChem with Gigabit Ethernet, the scaling will be poor, particularly for TCE jobs.

3) Because of the answer given in response 1, you should be fine with a diskless cluster. I do almost all of my production work with NWChem on the Cray XC30 at NERSC ("Edison"), which is effectively diskless (there are disks but NERSC strongly discourages their use for scratch).


Forum >> NWChem's corner >> Running NWChem