Running NWChem on 2 nodes takes more time than a single node


Clicked A Few Times
I am running Ubuntu 14.04.3 server with NWChem (6.3) installed from the repositories. Hardware specs are 6-core cpu, 64GB RAM, 1GbE, SSD on each node. Walltime for a test simulation is ~50% more on 2 nodes than on a single node; cpu times are almost the same.

I didn't expect a super scaling from a GbE network, but performance is rather disappointing even on just 2 nodes. I measured (with nload) a peak rate of 160MBit/s which is way below the 1000MBit/s limit of the onboard card. So, throughput does not seem to be the problem. I also noted that some cores are not running at 100% all the time (especially on the second node). Is there any chance the limiting factor is the high latency of Ethernet networks? What could i do to run NWChem on a slow Ethernet network?


Thanks in advance,
Kostas

Gets Around
Is it a sufficiently demanding calculation that you would expect it to make efficient use of the assigned hardware resources? Really small calculations will show poor or even negative scaling at 12 cores even when all cores are on the same motherboard.

Try using tcpdump to see how many bytes and packets are transferred during your test run. If there are a lot of smallish messages, I think you are fundamentally limited by latency.

Apart from your scaling woes, you may wish to install version 6.6 from source. There have been a lot of bug fixes and enhancements since 6.3. The Ubuntu package won't be linked with a high performance BLAS either.

Forum Vet
Quote:Extremis Nov 5th 10:32 am
Is there any chance the limiting factor is the high latency of Ethernet networks? What could i do to run NWChem on a slow Ethernet network?


Thanks in advance,
Kostas


Yes, network latency can be a limiting factor for several NWChem modules.
We have done a bit of work try to overcome this issue in NWChem 6.6.
One important step to address this performance issues is to adopt the MPI-PR ARMCI_NETWORK.

My suggestion to you is download NWChem 6.6 and compile it with ARMCI_NETWORK=MPI-PR

Clicked A Few Times
Thank you both for your suggestions!

I measured ~20000 packets/sec are exchanged between the 2 nodes during program execution, i guess that's too much; ping reports ~0.2ms for small packets. I also tried a more demanding (Coupled Cluster) calculation; it didn't provide any speedup either, but this time the limiting factor was throughput rather than latency, as transfer rates were constantly >950MBit.

I have downloaded and compiled NWChem 6.6 with ARMCI_NETWORK=MPI-PR. When i run it (even a single task) i always get the following error:

[0] Received an Error in Communication: (1) there must be at least two ranks per node
application called MPI_Abort(comm=0x84000000, 1) - process 0


Perhaps i should open a new thread under Compiling NWChem.


Thanks again,
Kostas

Forum Vet
Quote:Extremis Nov 9th 12:16 pm

I have downloaded and compiled NWChem 6.6 with ARMCI_NETWORK=MPI-PR. When i run it (even a single task) i always get the following error:

[0] Received an Error in Communication: (1) there must be at least two ranks per node
application called MPI_Abort(comm=0x84000000, 1) - process 0



The error message is telling you that you need to use at least two processes/node, e.g. (on a single ndoe)

mpirun -np 2 ...

On two nodes

mpirun -np 4 ...

Clicked A Few Times
I have already tried that with mpirun, mpiexec and srun, but the problem still persists.


Forum >> NWChem's corner >> Running NWChem