5:35:33 AM PDT - Fri, Jul 15th 2011 |
|
Hello and thanks for your reply.
I run 6 processes on node 1 and 4 processes on node 2. Both nodes use DDR 3 memory with the same bandwidth. I never observed any swapping. Further I use the "direct" directive in all runs, so I think the disk speed cannot be limiting.
As mentioned, scaling during SCF if very good. I ran one job on node 1 and the same on both nodes. On both nodes the wall clock time decreased with a divisor of 1.6. Since both nodes together have (theoretical) 34,66 GHz, which is 1.65 times more than node 1 alone, this is almost perfect. However, during an optimization the total wall clock time decreases only with a divisor of 1.3. I assume that this is due to the gradient evaluation and the thereby occurring drop in cpu usage.
The problem exist also if I run only 1 process per node, so maybe latency is limiting. A ping request takes usually about 0.08 ms from one node to the other. Do you have any experience whether this is to long or how else I may check the latency? Or do you have any ideas how to decrease the impact of high latency?
Thanks.
|