Cover V12, I13
aug2003.tar

More Nodes or a Higher-Performance Interconnect?

If you want to boost the performance of your cluster, you typically have two choices: buy more nodes or get a higher performance interconnect. However, interconnect cost is often not linear.

Smaller systems can be made by tightly integrated, unmanaged switches that can cost less than $100 per port. However, these inexpensive switches are not suitable for building larger networks. If the cluster size dictates managed Layer 3 switches, the cost jumps to $500-$1,000 per port. Switches often provide 2n or 2m +2n ports. If the switch enclosure is fully equipped, expanding the switch carries more cost per port. For example, if a single switch enclosure supports 64 ports, you will need six fully populated enclosures to support 128 ports maintaining full bi-sectional bandwidth -- a three-fold price premium. In other words, the last 64 ports will cost five times as much as the first 64. (See Figure 3.)

Applications can also be very sensitive to latency, which can be driven by the processing power available on the cluster. This is particularly true when the system is scaled. What is the overhead for the application to send or receive messages, and how long are packets in transit? How do these numbers change as the cluster is scaled up in size? It's this limitation that will ultimately constrain the scalability of any application. At some point, it is not cost-effective to add more hardware to support the application.

However, larger clusters do offer two benefits. First, the systems administration cost can be optimized by having fewer but larger clusters. Second, if the key usage of the cluster is throughput oriented, several instances of the application can run on different parts of the cluster at the same time.

That said, different applications have different requirements for the interconnect; they can be sensitive to latency, bandwidth, or both. How you scale the application is therefore very important. Scaling the system to reduce the turnaround time often changes the application's characteristics. With few nodes, the application will send a few large messages infrequently, whereas when you scale it to many nodes (hundreds or thousands), the application will typically send short messages at tighter intervals. Thus, frequent exchange of small messages puts a larger burden on the interconnect in comparison to infrequent exchange of large messages.

This restriction to the scalability of the application is depicted by Amdahl's law -- for systems administrators, Amdahl's law implies that the behavior of an application on a small cluster can not necessarily be transcribed into a larger cluster. An application that appears to run well using a legacy interconnect on a handful number of nodes may likely require an exotic, low-latency interconnect to take advantage of a larger cluster.

You can also scale the system to solve a larger problem. Findings from a small cluster can often be transcribed into a scenario with a larger cluster, in particular for its computational phase. On the other hand, an application's input and output phase might be handled sequentially where special consideration is needed to avoid a bottleneck, which in turn restricts the performance gain offered by the larger cluster.