NWChem over Gigabit Ethernet

Guest -

2:42:09 PM PDT - Wed, Jun 29th 2011
I use NWChem on two nodes, which are connected to each other over 4 bonded gigabit ethernet ports. Bandwidth tests with iperf showed a usable bandwidth of 2,33 GBit/s. Now I tried to start a Job distributed on both nodes. During SCF the scaling is very good. However, during gradient evaluation the CPU usage on the second node drops to 30 to 50 % and on the first node to 80 to 90 %. The bandwidth usage never exceeds approx. 15% of the available 2,33 GBit/s. So I like to know, whether it is possible to improve bandwidth usage and scaling performance. Hardware: node 1: AMD Phenom II 1090T (6 x 3,51 GHz), 8 GB RAM node 2: AMD Phenom II 965 (4 x 3,4 GHz), 4 GB RAM Software: Linux 2.6.37 running openSuSE 11.4 NWChem Apr 15 2011 /proc/sys/net/ipv4/tcp_low_latency set to 1 mtu set to 7200, which is the network driver's maximum OpenMPI 1.4.3 Thanks

Forum Vet

5:09:02 PM PDT - Wed, Jul 6th 2011
I have not seen anything on scaling performance, i.e. faster runs with more processors. Are the nodes used fully packed? Nodes have different speeds, which will affect memory access, network access. They are probably different boards, do they even have the same bandwidth to memory? Nodes have different amounts of memory. Are you taxing node 2 more (some swapping)? If there are differences bandwidth between the two nodes then one of them could be waiting more then the other. NWChem uses disk too, are the disk systems the same? As to your questions: Improve bandwidth usage, might be that your molecule and data distribution doesn't need more. Much is actually driven by latency as the messages are not too large. Scaling performance, don't see any data so can't comment. Bert Quote: Jun 29th 9:42 pm I use NWChem on two nodes, which are connected to each other over 4 bonded gigabit ethernet ports. Bandwidth tests with iperf showed a usable bandwidth of 2,33 GBit/s. Now I tried to start a Job distributed on both nodes. During SCF the scaling is very good. However, during gradient evaluation the CPU usage on the second node drops to 30 to 50 % and on the first node to 80 to 90 %. The bandwidth usage never exceeds approx. 15% of the available 2,33 GBit/s. So I like to know, whether it is possible to improve bandwidth usage and scaling performance. Hardware: node 1: AMD Phenom II 1090T (6 x 3,51 GHz), 8 GB RAM node 2: AMD Phenom II 965 (4 x 3,4 GHz), 4 GB RAM Software: Linux 2.6.37 running openSuSE 11.4 NWChem Apr 15 2011 /proc/sys/net/ipv4/tcp_low_latency set to 1 mtu set to 7200, which is the network driver's maximum OpenMPI 1.4.3 Thanks

Guest -

5:35:33 AM PDT - Fri, Jul 15th 2011
Hello and thanks for your reply. I run 6 processes on node 1 and 4 processes on node 2. Both nodes use DDR 3 memory with the same bandwidth. I never observed any swapping. Further I use the "direct" directive in all runs, so I think the disk speed cannot be limiting. As mentioned, scaling during SCF if very good. I ran one job on node 1 and the same on both nodes. On both nodes the wall clock time decreased with a divisor of 1.6. Since both nodes together have (theoretical) 34,66 GHz, which is 1.65 times more than node 1 alone, this is almost perfect. However, during an optimization the total wall clock time decreases only with a divisor of 1.3. I assume that this is due to the gradient evaluation and the thereby occurring drop in cpu usage. The problem exist also if I run only 1 process per node, so maybe latency is limiting. A ping request takes usually about 0.08 ms from one node to the other. Do you have any experience whether this is to long or how else I may check the latency? Or do you have any ideas how to decrease the impact of high latency? Thanks.

Forum Vet

2:08:25 PM PDT - Mon, Jul 18th 2011
There is a lot more communication happening in the gradient evaluation, so that will be an important factor. I don't know how to reduce latency, not a hardware/computer scientist. You may want to ask a computer scientist in your department. It's GigE, there are better networks available. You are not running the same number of processes on each node either. Depending on how the data is laid out in memory (i.e. is distributed over the nodes) you may be creating an asymmetric communication pattern (one node needing to do a lot more then the other). Bert Quote: Jul 15th 12:35 pm Hello and thanks for your reply. I run 6 processes on node 1 and 4 processes on node 2. Both nodes use DDR 3 memory with the same bandwidth. I never observed any swapping. Further I use the "direct" directive in all runs, so I think the disk speed cannot be limiting. As mentioned, scaling during SCF if very good. I ran one job on node 1 and the same on both nodes. On both nodes the wall clock time decreased with a divisor of 1.6. Since both nodes together have (theoretical) 34,66 GHz, which is 1.65 times more than node 1 alone, this is almost perfect. However, during an optimization the total wall clock time decreases only with a divisor of 1.3. I assume that this is due to the gradient evaluation and the thereby occurring drop in cpu usage. The problem exist also if I run only 1 process per node, so maybe latency is limiting. A ping request takes usually about 0.08 ms from one node to the other. Do you have any experience whether this is to long or how else I may check the latency? Or do you have any ideas how to decrease the impact of high latency? Thanks.

Forum >> NWChem's corner >> Running NWChem