Hardware recommendation


Click here for full thread
Gets Around
Quote:Extremis May 22nd 12:27 pm
I am building a mini Cluster for abinitio simulations (mainly NWChem).

1) For large simulations, huge scratch files are created and even swap file is used. So, is there any real performance gain in putting 2 SSD's in RAID 0? Or should i go with 1 fast SSD with double capacity and save a few money (and also worry less for a failure)? In other words, does a 6-core cpu (5820K) read/write data faster than a SSD can provide? Are those reads/writes usually sequential or random?

2) Is Gigabit Ethernet generally fast enough for running a simulation parallel on 4/8 nodes (each node will have a 5820K processor)? Is there any point in purchasing additional GbE cards on each node and using them together (with NIC bonding) with the motherboard's provided GbE?

3) Should i put local disks on each node, or build a stateless cluster with fast SSD's only on master node?

While i am aware that these questions are not easily answered generally, i am specifically interested in NWChem performance, especially CC (CCSD(T), CREOMCCSD(T), IPCCSD), RT-TDDFT calculations.


Disclaimer: I work for Intel Corporation and thus I have an obvious conflict of interest with respect to processor choice, hence will not comment on that. Note also that Intel has multiple interconnect products, including True Scale InfiniBand, so I will refer to InfiniBand generically, without specifying an implementer. Finally, Intel also makes SSDs, but my comments on that topic are in no way implementation-specific.

First, know that PNNL builds a supercomputer to run NWChem effectively (among other goals, obviously, but NWChem is a big one). They chose the configuration described here: https://www.emsl.pnl.gov/emslweb/instruments/computing-cascade-atipa-1440-intel-xeon-phi-n.... I had nothing to do with the design or procurement of this machine.

Regarding your specific questions:

1) the TCE is designed to run in memory, not using disk. You can find many references on this site on this issue. The modules where fast disks help are semidirect MP2 and semidirect CCSD(T), the latter being the non-TCE implementation.

2) Gigabit Ethernet is not a good interconnect for Global Arrays. It is almost certainly the worst possible choice you could make in this respect. A network with one-sided capability, e.g. InfiniBand, is appropriate for Global Arrays. If you try to run NWChem with Gigabit Ethernet, the scaling will be poor, particularly for TCE jobs.

3) Because of the answer given in response 1, you should be fine with a diskless cluster. I do almost all of my production work with NWChem on the Cray XC30 at NERSC ("Edison"), which is effectively diskless (there are disks but NERSC strongly discourages their use for scratch).