Article
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Table 1

nov2004.tar

Dissecting PC Server Performance

Bryan J. Smith

Servers are all about I/O. Commodity PC servers have rarely offered high-performance I/O, until now. In this article, I'll dissect PC server design and discuss how to evaluate the performance of AMD Opteron solutions.

All about I/O

"But it's dual processor? And it has a high-speed Front Side Bus?" This has been a frequent response to my recommendations that a system be upgraded. In many cases, replacing not the whole system but just the mainboard will suffice.

Input/Output (I/O) interconnect between disparate components like network, storage, CPU, and memory continues to be one of the most misunderstood concepts in PC servers. It is perplexing to an increasing number of IT professionals as commodity, I/O-conscience solutions like AMD's Opteron become reality. Basic concepts surrounding how systems interconnect CPU, memory, and I/O must be understood to make effective purchasing decisions. In this article, I'll discuss these foundations and provide real-world examples by breaking common PC technical thought with diagrams of common Intel and AMD PC server solutions.

Interconnect Is Everything

The means by which components in a system connect to each other -- from megabytes per second (MBps) to gigabytes per second (GBps) -- is generally known as interconnect. There are two major categories of components that are interconnected in a system. These are:

1. System components, such as CPU, memory, bridges.

2. Peripheral (I/O) components, including audio, network, storage, video.

End-user PC technicians are familiar with the Peripheral Component Interface (PCI), its higher speed PCI-X variant, and Intel's graphics variant known as the Accelerated Graphics Port (AGP). By the time this article is published, the new serial PCI-Express standard from the PCI Standards Group (PCISG) will also be available. PCI continues to form the backbone of the overwhelming majority of PC solutions. The peripheral component interconnect, hereafter referred to as the I/O interconnect, can be a bottleneck on its own, especially for servers. But even a good I/O interconnect, like multiple PCI-X channels, is nothing without a good system component interconnect.

The system component interconnect is more of an issue for engineers. Unlike PCI, PCI-X, AGP, PCI-Express, and other end-user/technical pluggable components, system components are fixed components with fixed traces (wires) in a mainboard (a.k.a. motherboard) printed circuit board (PCB). These system components include the CPU(s), memory, and any bridges or chips that translate the signaling of traces, between CPU(s), memory, and I/O.

Until recently, unlike I/O interconnects, there are few standards when it comes to system interconnects. The system was always a collection of "glue logic" between disparate system components (e.g., CPU and memory) as well as to I/O. There was no universal system interconnect, and every platform had its own implementation. This caused increased engineering costs, poor designs in cheaper components, and often an overuse of slower I/O approaches for faster system components. The industry longed for a universal, "glue-less" system component interconnect.

The Traditional PC Chipset

In the PC world, cost is always a factor, which in the past resulted in a heavily duplicated, poor interconnect approach that even plagued small- to medium-business (SMB) servers. Only a few system designers such as ServerWorks (formerly known as Reliance Computer Corporation, RCC) attempted improved system I/O designs for PC OEMs and, even more rarely, on the distributor shelf. These designs were costly and never well understood, so most integrators never saw the total cost benefit. They were limited to largely RISC/Unix platforms where cost was not as much of a concern prior to the 21st century.

The traditional PC chipset is the most well known among PC technicians. The chipset is a collection of bridges for system interconnect, connecting CPU(s), memory, and I/O. The most common PC approach, even used for PC servers, was to build a two-chip bridge chipset. The bridge for system interconnect was called the northbridge, and the bridge for I/O interconnect was known as the southbridge. The two were connected to each other typically over an I/O interconnect like PCI or PCI-X. An example of the Athlon, Pentium III, and even some Pentium 4 Xeon system interconnects is illustrated in Figure 1. Contention for system interconnect, CPU to/from memory to/from I/O access, in the traditional chipset design has always plagued the PC.

The Front Side Bottleneck and Shared I/O

A major focal point in the problem of the traditional chipset design begins with the term known as the Front Side Bus (FSB). The FSB suggests that there is only a single point of system interconnect between the CPU and the rest of the system (with the Back Side Bus, BSB, on the chip itself). The CPU connects to the northbridge, a single bus between all other components that means only two system components can talk to each other at any time. Furthermore, it means that the CPU, memory, and bridges run synchronously, possibly with associated timing and stability issues. In server space, this is nothing less than a "Front Side Bottleneck".

From the I/O perspective, shared I/O buses are always an issue. Today's servers include both 100+ MBps (0.1+ GBps) RAID storage and 100+ MBps (0.1+ GBps) Gigabit Ethernet (GbE), yet the majority of PC systems ship with only a single, shared, 133 MBps (0.125 GBps) PCI bus. Such implementations are instantly saturated and suffer a massive amount of I/O contention between storage and network access. The southbridge often contains onboard ATA and NIC logic, which are usually implemented as PCI masters connected to the same, shared 0.125 GBps (32-bit x 33 MHz) PCI bus as all other peripherals. Some newer southbridges now use a proprietary NIC-only interconnect to remove some contention, but there is still a great amount of contention for just I/O. This is true even before addressing the fact that all CPU-to-memory-to-I/O communication is over the same, shared system interconnect -- be it [A]GTL+ (Pentium Pro-Pentium 4/Xeon) or EV6 (Alpha, Athlon).

Widening the Bus

Now let's move beyond the $100-$200 PC mainboard. The way makers of most $300-$1000+ PC server mainboards achieved improved performance was to widen the CPU and memory connections to the northbridge. When these connections were widened, a new, more expensive chipset was required. But this still did not solve the issue that only two system components could talk at any time. When a CPU is using the system interconnect, I/O cannot, and vice versa. Worse yet, PCI signaling, which is clearly an I/O interconnect, is still used heavily throughout the system interconnect, even in the latest Intel/ServerWorks chipsets.

At this higher price point, most chipset vendors then bridged additional I/O channels with one or more 266-1066 MBps (0.25-1.0 GBps) 64-bit PCI-X buses, in addition to the 133 Mbps (0.125 GBps) PCI/Legacy PC bus. But even in this configuration on the latest Intel/ServerWorks chipsets, the transfer still ties up the entire interconnect in the traditional PC chipset design (see Figure 2). In a nutshell, everything still goes through the northbridge (or Memory Controller Hub (MCH) in Intel nomenclature) even for most four-way Intel Xeon MP solutions.

Non-Uniform Memory Architecture (NUMA)

One reason that several RISC/Unix platforms (as well as the costly new Intel IA-64 Itanium) have remained popular as server platforms is because they do not implement a shared system interconnect like the traditional PC northbridge design. While some I/O is still shared, at least CPUs have direct connections to memory. This Non-Uniform Memory Architecture (NUMA) approach localizes memory to each CPU, so there is no contention over a shared CPU bus to memory.

Aside from the cost, the only downside to NUMA is that a memory location needed by a program may not be local to the CPU on which it is executing. Therefore, for the best performance, a NUMA-aware OS allocates memory for a program, or any of its threads, on the same CPU on which it is executing. This is known as processor affinity. The result is an OS-hardware combination that scales far better (increasing in performance almost linearly) beyond dual-processor than Xeon MP could.

A select handful of PC OEMs then either designed or licensed their own chipset-mainboard designs with NUMA for Xeon MP. Although proprietary and quite costly, 4-way Xeon MP systems implementing NUMA scaled far better than non-NUMA 4-way Xeon MP systems. They performed closer to 4-way RISC or Itanium but were not as costly. An example NUMA Xeon MP system is illustrated in Figure 3 and shows the reduction in contention when memory and I/O access are segmented.

NUMA Xeon MP systems addressed several details, but other issues remained:

1. Required non-commodity; support system interconnect chips; increasing cost.

2. The Xeon, system interconnect, and memory were typically on a daughter card; more cost.

3. The CPU itself still had a single FSB, so the CPU still could only do one transfer at a time.

4. CPU-to-CPU and CPU-to-I/O transfers were still over a shared, "old-style" FSB.

HyperTransport: Commodity System Interconnect

Xerox PARC provided much of the innovation for the 1980s. Digital Semiconductor did much of the same for the 1990s. From PCI logic to custom bridges to endless Alpha chipset approaches, Digital Semiconductor led much of the charge in high-throughput system designs. The up to 16-node Digital EV6 switched northbridge design of Slot-A, Slot-B, and Socket-462 for the Digital Alpha 21264 (and the less expensive AMD Athlon MP) was an improvement over the traditional shared bus northbridge of Intel PC systems. However, it still did not address the issues of a single point of contention -- the northbridge. Upon the sell-a-thon of former CEO Palmer, Digital saw a mass exodus of its Semiconductor Division's engineers to two organizations -- Alpha Processor Inc. (API, later known as API Networks and recently purchased outright by AMD) and AMD itself.

It was no secret that API Networks and AMD were working on a new way to approach interconnect -- one to compete with Intel's IA-64 Itanium. The idea behind the original Lightspeed Data Transport (now known as HyperTransport) was designed to address:

Universal, "glue-less" interconnect for CPU.
A scalable speed, "glue-less" interconnect for bridges.

HyperTransport significantly reduces engineering design and cost and turns a previous, custom, high-end, redesign-intensive concept for engineers into a glue-less, place and route, platform-agnostic technology. This approach is so universal and ideal for system interconnect that nVidia and SiS use it in their chipsets for Intel Pentium 4 and Xeon platforms instead of the more I/O-centric PCI signaling. A HyperTransport bridge or tunnel can be attached anywhere HyperTransport is offered. Adding AGP, PCI-X, PCI, or even the forthcoming PCI-Express to any mainboard is literally a snap for engineers and actually reduces some of the pin- and wire-trace count over equivalent, legacy approaches in RISC or high-end Xeon systems.

In fact, the traditional PC print and Internet media continues to trip over itself trying to predict whether and when AMD's x86-64 will be adopted by Apple, IBM, Sun, and other hardware vendors. This is a testament to how interconnect is so wholly undervalued, because Apple, IBM, Sun, and many other vendors licensed HyperTransport for their non-x86 designs, not so much x86-64 itself (although Sun is porting Solaris to x86-64). HyperTransport solves issues for non-PC platforms as much as for the PC.

But while HyperTransport addresses design cost, it does not solve system and I/O interconnect on its own. HyperTransport needed a flagship platform. Not surprisingly, AMD had one in mind.

Opteron: NUMA and HyperTransport

There is no such thing as an FSB in Athlon 64, much less Opteron. With the introduction of the Athlon64/Opteron, FSB becomes an extremely poor measure of interconnect. All AMD x86-64 processors have no less than 2 (and up to 5) independent system interconnects on each individual CPU. The points of system interconnect on Socket-754, 939, and 940, as well as Socket-478, 603/604, and LGA-775 are illustrated in Figure 4.

The Athlon 64 and Opteron completely move the traditional concept of a northbridge into the CPU itself. Any PC-enthusiast review referencing a 600 (1200 mega-transaction, 1200MT), 800 (1600MT), or 1000MHz (2000MT) "FSB" in an Athlon 64 configuration is merely referencing one of at least two or more points on connection of the Athlon64/Opteron. Even single-processor Athlon 64 processors have one or more dedicated memory channels separate from the HyperTransport I/O.

The key to Opteron 200/800 scalability is in the additional HyperTransport channels for CPU-to-CPU interconnect. It is not merely a switching system interconnect versus a bus (as Alpha/Athlon EV6 is to Intel [A]GTL+); each CPU has direct connections to its own memory, other CPUs, and then I/O. To draw an analogy to Ethernet: AGTL+ (Xeon MP) systems are connected to each other, to memory, and to I/O all through the same Ethernet hub; EV6 (Athlon MP) is like having an Ethernet switch; and NUMA/HyperTransport (Opteron) is like having direct Ethernet connections between (no hub or switch).

The memory connections are direct, glue-less, 184-pin DDR SDRAM channels directly connected to the chip. There is one DDR channel in Socket-754 and two DDR channels in Socket-939/940 (noting that 754+184=938, there are two, true, full, non-multiplexed DDR buses in the case of Socket-939/940). There are several advantages to this. One is the fact that memory speed no longer limits the performance of any other interconnect. In other words, the memory clock is its own, independent divider of the local CPU clock -- separate from all others, including I/O. Using DDR266/PC2100 (2.1 GBps/channel) instead of DDR400/PC3200 (3.2 GBps/channel) makes no impact on the CPU speed or HyperTransport clocks. And, it gets better.

In proprietary NUMA Xeon MP, if a program on one CPU needs to access the memory located on another, it must use the shared, old-style FSB between CPUs. Not only does only one CPU access that bus at a time, but it prevents any other transfer -- CPU to/from memory to/from I/O -- from occurring. In HyperTransport Opteron, each CPU has a dedicated connection to another CPU (or within one hop of another, in the case of Opteron 800), so there is no (or limited) contention with other memory, CPU, or I/O transfers (see Figure 5).

Opteron: I/O MMU

The Opteron's I/O Memory Management Unit (MMU) begins with EV6 and 32-bit Athlon MP. EV6, designed for 64-bit Alpha, uses 40-bit physical addressing. Unlike Intel AGTL+, which is a "bus", EV6 is a point-to-point (up to 16 node) interconnect. As such, to maintain consistency with select I/O devices (e.g., AGP), AMD had to move I/O memory management out of the northbridge and on to each processor. The result was that each 32-bit Athlon could theoretically (even implemented by a few mainboard BIOSes with a "Linux" memory access option) and safely handle not only memory addressing but memory mapped I/O, up to 40-bit (1 TB).

Opteron merely extends this to 48-bit, the largest size compatible with the 256 TB address space of the i486's Translation Lookaside Buffer (TLB), with a formal I/O MMU on each Opteron. Opterons directly talk to each other over dedicated, HyperTransport links with all memory or I/O localized to each Opteron. So maintaining consistency with the cache, memory, or memory-mapped I/O of one Opteron is inherently accommodated in the access from another. Best of all, any transfers from any I/O to memory are local to the specific Opteron to which they are connected. Thus, the concept of "processor affinity" under an Opteron is used not only for executing processes and memory, but for localizing I/O transfers between memory and the memory mapped I/O -- the ultimate in server I/O efficiency and performance.

Intel AGTL+, on the other hand, was designed only for 32-bit processors with 32-bit addressing, or 36-bit addressing using Page Address Extensions (PAE). Some AGTL+ platforms do not even offer 36-bit PAE and are physically limited to 32-bit. In IA-32e processors (Intel 64-bit extensions), memory-mapped I/O addressing is only protected and guaranteed in hardware up to and under 4 GB. The IA-32e also lacks a true I/O MMU, so it does not implement those x86-64 instructions found in AMD's 64-bit processors. Under Linux/x86-64 (among other x86-64 kernels), a "software I/O MMU" has been adopted for IA-32e platforms to make up for this deficiency versus the AMD x86-64 implementation.

Until IA-32e switches away from Socket-604 (which Intel has hinted at doing by merging with IA-64/Itanium's Infiniband interconnect in 2006+), this is a hard, physical limitation of the platform itself. Thus, all current and near-future IA-32e servers should not be considered for I/O-intensive applications when the solution calls for more than 4 GB of physical RAM (combined system and memory-mapped I/O).

Evaluating Opteron Solutions

Now that I've shown that the Opteron is the greatest thing since sliced bread, it's time to talk about evaluating Opteron solutions. OEMs often cut corners in products, and the Opteron is no exception. In fact, the Opteron is so flexible, it can actually be designed to perform worse than Intel solutions. Figure 6 shows an example of a very common, low-cost, dual-Opteron mainboard where the OEM has cut corners.

Why might an OEM do this? Traces on a mainboard cost money. Since each DDR memory channel is no less than 184 pins, each individual Opteron requires 368 additional traces just for memory (not including I/O or connections to other CPUs). Opterons can use memory from other Opterons in the same system, so it's possible for one Opteron to have no memory attached and to access the memory connected to another Opteron. Mainboard designs as illustrated in Figure 6 are more common than you might think.

Another consideration also illustrated is the lack of an AMD8131/8132 dual-PCI-X 1.0/2.0 HyperTransport tunnel. Even with all that interconnect, most low-cost Opteron mainboards ship with just an AMD8111 PCI/LPC HyperTransport bridge, which offers only 0.125GBps of I/O (excluding of video). Most nVidia, SiS, and ViA chipsets have similar limitations in I/O, rendering the interconnect advantages of Opteron useless outside of memory performance. In these chipsets, even though the second Opteron 200 in a dual-processor system has an open and available HyperTransport for I/O, the cost of the chip and traces for PCI-X would increase costs.

So, although the Opteron offers a commodity, high-performing I/O platform for server applications at a price point far lower than RISC, Itanium, and NUMA Xeon MP, the market is also flooded with cheap Opteron designs available for less than $200. System integrators and systems administrators alike must be careful to purchase Opteron 200/800 systems with quality mainboard implementations.

Here is a checklist that will help ensure you are purchasing a high-quality system:

Does each (not just one) Opteron have 144-bit (dual-DDR ECC) memory channels?
Does the system offer more than one shared, legacy PCI bus (e.g., least one AMD8131/8132 HyperTransport dual-PCI-X 1.0/2.0 tunnel)?
Are all of the dual-PCI-X HyperTransport tunnels/bridges daisy-chained to one CPU? That is, are the dual-PCI-X HyperTransport tunnels each connected to one CPU, in a 4-way system?

The difference in design can mean the difference between having a system that performs like the AMD reference designs and a system that offers less than a quality Xeon MP system.

PCI-Express and DDR2

By the time this article is published, two new technologies will have arrived -- PCI-Express and DDR2 SDRAM. Both are being readily implemented on the single processor, consumer Intel LGA-775 platform, although Socket-604 Xeon MP adoption is forthcoming.

PCI-Express is both a graphics and generic I/O interconnect. It can co-exist with AGP or replace it. From the standpoint of storage and network connectivity, each x1 (and rarer x4) slot can provide 0.125 GBps+ of dedicated bandwidth. These are ideal for entry-level, single-processor servers versus today's single, shared, 0.125-GBps PCI systems. But multiple PCI-X channels are still recommended for higher-end systems and I/O. In fact, one major OEM has prototyped a dual-processor workstation/server that attaches an nVidia PCI-Express chipset (with x16, x1/x4 slots) to one Opteron, while attaching the AMD8131 dual-PCI-X tunnel to another. Such is the flexibility that HyperTransport affords system designers, at a price point that was previously unheard of in any platform (sub-$500 list).

Sun Microsystems is taking HyperTransport one step further in a new Opteron workstation line. They use a modular, 6.4-GBps HyperTransport connector between the top and bottom portions of the mainboard. The end-user can buy the bottom portion with AGP/PCI-X tunnels today and replace it with PCI-Express tunnels tomorrow. Meanwhile, Sun's Opteron 100/200 processor solution already offers as much I/O today (four PCI-X channels using two AMD8131 tunnels, in addition the AGP and legacy PC I/O) as AMD's reference Quad-Opteron 800 design.

DDR2 enthusiasts will be quick to note that Intel is offering 533MHz and, soon, 667MHz JEDEC DDR2 memory module compatibility. Although the throughput looks good on paper, AGTL+ is still a shared system interconnect with a synchronous clock. The result is limited potential for the additional throughput using the new technology on even Socket-604 with IA-32e. Although some reviews have argued that AMD is less flexible because the memory controller is on the CPU, and not the chipset, there is nothing stopping AMD from releasing Opterons with DDR2 memory support. AMD has hinted that it will do so, if and when the demand is there, which is seemingly associated with waiting for more commodity pricing at this time. Unlike shared AGTL+, the performance of CPU or I/O interconnect in Opteron is not reliant on and is far less affected by memory clock.

Conclusion

AMD's Opteron opens up the previously exclusive realm of high-performing NUMA and I/O systems design to the masses at a price point far below a RISC or Itanium product. But, as with any commodity, OEMs will cut corners and offer less than ideal designs as the cost of traces and extra HyperTransport bridge/tunnel chips increase mainboard manufacturing costs.

System integrators and systems administrators must be wary of these products and carefully review specification sheets to verify that solutions are designed well. In this article, I introduced these concepts along with a checklist of capabilities to look for in an Opteron solution, including considerations for forthcoming PCI-Express and DDR2 systems.

When all else fails, trusting a reputable OEM is always of significant value. Hewlett-Packard's (HP) ProLiant DL585 is one of the first Opteron 800 solutions to implement the full AMD reference design of two AMD8131/8132 dual-PCI-X HyperTransport tunnels. Even running legacy 32-bit Windows Server and applications, the DL585 is tearing up the benchmarks and outclassing other, far more costly 2-, 4-, and even 8-way systems at TPC-C, SAP, and MS Exchange benchmarks. Various enthusiast sites have found similar performance in the scalability of 2- and 4-way Opteron 200/800 systems with 32-bit Windows Server. They've reported far higher capabilities with Linux/x86-64 and full support for its I/O MMU that Intel IA-32e does not offer in Socket-604.

I sincerely hope that the Front Side Bus (FSB) be dropped as a valid benchmark of interconnect. It assumes only one point of connection from the CPU to the rest of the system, which is no longer true even in AMD's "new Duron" -- the Socket-754 Athlon64. Therefore, FSB is not a valid reference even for consumer desktops, much less the enterprise Opteron 200/800. Thus, to end this article, I offer my new term of reference -- Aggregate Front Side Throughout (AFST), measured in GBps for the best of each individual processor type in Table 1.

Bryan J. Smith has an educational background in computer architecture (BSCpE, UCF) and has spent much of his 12-year engineering/IT career applying systems design principles in the aerospace, civil, educational, financial, semiconductor, and software engineering industries. For the past 4 years, he has provided engineering, IT, and training services to a variety of clients. Bryan and his wife, Lourdes, live in Orlando. He can be contacted at: b.j.smith@ieee.org.