Dissecting
PC Server Performance
Bryan J. Smith
Servers are all about I/O. Commodity PC servers have rarely offered
high-performance I/O, until now. In this article, I'll dissect
PC server design and discuss how to evaluate the performance of
AMD Opteron solutions.
All about I/O
"But it's dual processor? And it has a high-speed Front
Side Bus?" This has been a frequent response to my recommendations
that a system be upgraded. In many cases, replacing not the whole
system but just the mainboard will suffice.
Input/Output (I/O) interconnect between disparate components like
network, storage, CPU, and memory continues to be one of the most
misunderstood concepts in PC servers. It is perplexing to an increasing
number of IT professionals as commodity, I/O-conscience solutions
like AMD's Opteron become reality. Basic concepts surrounding
how systems interconnect CPU, memory, and I/O must be understood
to make effective purchasing decisions. In this article, I'll
discuss these foundations and provide real-world examples by breaking
common PC technical thought with diagrams of common Intel and AMD
PC server solutions.
Interconnect Is Everything
The means by which components in a system connect to each other
-- from megabytes per second (MBps) to gigabytes per second
(GBps) -- is generally known as interconnect. There are two
major categories of components that are interconnected in a system.
These are:
1. System components, such as CPU, memory, bridges.
2. Peripheral (I/O) components, including audio, network, storage,
video.
End-user PC technicians are familiar with the Peripheral Component
Interface (PCI), its higher speed PCI-X variant, and Intel's
graphics variant known as the Accelerated Graphics Port (AGP). By
the time this article is published, the new serial PCI-Express standard
from the PCI Standards Group (PCISG) will also be available. PCI
continues to form the backbone of the overwhelming majority of PC
solutions. The peripheral component interconnect, hereafter referred
to as the I/O interconnect, can be a bottleneck on its own, especially
for servers. But even a good I/O interconnect, like multiple PCI-X
channels, is nothing without a good system component interconnect.
The system component interconnect is more of an issue for engineers.
Unlike PCI, PCI-X, AGP, PCI-Express, and other end-user/technical
pluggable components, system components are fixed components with
fixed traces (wires) in a mainboard (a.k.a. motherboard) printed
circuit board (PCB). These system components include the CPU(s),
memory, and any bridges or chips that translate the signaling of
traces, between CPU(s), memory, and I/O.
Until recently, unlike I/O interconnects, there are few standards
when it comes to system interconnects. The system was always a collection
of "glue logic" between disparate system components (e.g.,
CPU and memory) as well as to I/O. There was no universal system
interconnect, and every platform had its own implementation. This
caused increased engineering costs, poor designs in cheaper components,
and often an overuse of slower I/O approaches for faster system
components. The industry longed for a universal, "glue-less"
system component interconnect.
The Traditional PC Chipset
In the PC world, cost is always a factor, which in the past resulted
in a heavily duplicated, poor interconnect approach that even plagued
small- to medium-business (SMB) servers. Only a few system designers
such as ServerWorks (formerly known as Reliance Computer Corporation,
RCC) attempted improved system I/O designs for PC OEMs and, even
more rarely, on the distributor shelf. These designs were costly
and never well understood, so most integrators never saw the total
cost benefit. They were limited to largely RISC/Unix platforms where
cost was not as much of a concern prior to the 21st century.
The traditional PC chipset is the most well known among PC technicians.
The chipset is a collection of bridges for system interconnect,
connecting CPU(s), memory, and I/O. The most common PC approach,
even used for PC servers, was to build a two-chip bridge chipset.
The bridge for system interconnect was called the northbridge, and
the bridge for I/O interconnect was known as the southbridge. The
two were connected to each other typically over an I/O interconnect
like PCI or PCI-X. An example of the Athlon, Pentium III, and even
some Pentium 4 Xeon system interconnects is illustrated in Figure
1. Contention for system interconnect, CPU to/from memory to/from
I/O access, in the traditional chipset design has always plagued
the PC.
The Front Side Bottleneck and Shared I/O
A major focal point in the problem of the traditional chipset
design begins with the term known as the Front Side Bus (FSB). The
FSB suggests that there is only a single point of system interconnect
between the CPU and the rest of the system (with the Back Side Bus,
BSB, on the chip itself). The CPU connects to the northbridge, a
single bus between all other components that means only two system
components can talk to each other at any time. Furthermore, it means
that the CPU, memory, and bridges run synchronously, possibly with
associated timing and stability issues. In server space, this is
nothing less than a "Front Side Bottleneck".
From the I/O perspective, shared I/O buses are always an issue.
Today's servers include both 100+ MBps (0.1+ GBps) RAID storage
and 100+ MBps (0.1+ GBps) Gigabit Ethernet (GbE), yet the majority
of PC systems ship with only a single, shared, 133 MBps (0.125 GBps)
PCI bus. Such implementations are instantly saturated and suffer
a massive amount of I/O contention between storage and network access.
The southbridge often contains onboard ATA and NIC logic, which
are usually implemented as PCI masters connected to the same, shared
0.125 GBps (32-bit x 33 MHz) PCI bus as all other peripherals. Some
newer southbridges now use a proprietary NIC-only interconnect to
remove some contention, but there is still a great amount of contention
for just I/O. This is true even before addressing the fact that
all CPU-to-memory-to-I/O communication is over the same, shared
system interconnect -- be it [A]GTL+ (Pentium Pro-Pentium 4/Xeon)
or EV6 (Alpha, Athlon).
Widening the Bus
Now let's move beyond the $100-$200 PC mainboard. The way
makers of most $300-$1000+ PC server mainboards achieved improved
performance was to widen the CPU and memory connections to the northbridge.
When these connections were widened, a new, more expensive chipset
was required. But this still did not solve the issue that only two
system components could talk at any time. When a CPU is using the
system interconnect, I/O cannot, and vice versa. Worse yet, PCI
signaling, which is clearly an I/O interconnect, is still used heavily
throughout the system interconnect, even in the latest Intel/ServerWorks
chipsets.
At this higher price point, most chipset vendors then bridged
additional I/O channels with one or more 266-1066 MBps (0.25-1.0
GBps) 64-bit PCI-X buses, in addition to the 133 Mbps (0.125 GBps)
PCI/Legacy PC bus. But even in this configuration on the latest
Intel/ServerWorks chipsets, the transfer still ties up the entire
interconnect in the traditional PC chipset design (see Figure 2).
In a nutshell, everything still goes through the northbridge (or
Memory Controller Hub (MCH) in Intel nomenclature) even for most
four-way Intel Xeon MP solutions.
Non-Uniform Memory Architecture (NUMA)
One reason that several RISC/Unix platforms (as well as the costly
new Intel IA-64 Itanium) have remained popular as server platforms
is because they do not implement a shared system interconnect like
the traditional PC northbridge design. While some I/O is still shared,
at least CPUs have direct connections to memory. This Non-Uniform
Memory Architecture (NUMA) approach localizes memory to each CPU,
so there is no contention over a shared CPU bus to memory.
Aside from the cost, the only downside to NUMA is that a memory
location needed by a program may not be local to the CPU on which
it is executing. Therefore, for the best performance, a NUMA-aware
OS allocates memory for a program, or any of its threads, on the
same CPU on which it is executing. This is known as processor affinity.
The result is an OS-hardware combination that scales far better
(increasing in performance almost linearly) beyond dual-processor
than Xeon MP could.
A select handful of PC OEMs then either designed or licensed their
own chipset-mainboard designs with NUMA for Xeon MP. Although proprietary
and quite costly, 4-way Xeon MP systems implementing NUMA scaled
far better than non-NUMA 4-way Xeon MP systems. They performed closer
to 4-way RISC or Itanium but were not as costly. An example NUMA
Xeon MP system is illustrated in Figure 3 and shows the reduction
in contention when memory and I/O access are segmented.
NUMA Xeon MP systems addressed several details, but other issues
remained:
1. Required non-commodity; support system interconnect chips;
increasing cost.
2. The Xeon, system interconnect, and memory were typically on
a daughter card; more cost.
3. The CPU itself still had a single FSB, so the CPU still could
only do one transfer at a time.
4. CPU-to-CPU and CPU-to-I/O transfers were still over a shared,
"old-style" FSB.
HyperTransport: Commodity System Interconnect
Xerox PARC provided much of the innovation for the 1980s. Digital
Semiconductor did much of the same for the 1990s. From PCI logic
to custom bridges to endless Alpha chipset approaches, Digital Semiconductor
led much of the charge in high-throughput system designs. The up
to 16-node Digital EV6 switched northbridge design of Slot-A, Slot-B,
and Socket-462 for the Digital Alpha 21264 (and the less expensive
AMD Athlon MP) was an improvement over the traditional shared bus
northbridge of Intel PC systems. However, it still did not address
the issues of a single point of contention -- the northbridge.
Upon the sell-a-thon of former CEO Palmer, Digital saw a mass exodus
of its Semiconductor Division's engineers to two organizations
-- Alpha Processor Inc. (API, later known as API Networks and
recently purchased outright by AMD) and AMD itself.
It was no secret that API Networks and AMD were working on a new
way to approach interconnect -- one to compete with Intel's
IA-64 Itanium. The idea behind the original Lightspeed Data Transport
(now known as HyperTransport) was designed to address:
- Universal, "glue-less" interconnect for CPU.
- A scalable speed, "glue-less" interconnect for bridges.
HyperTransport significantly reduces engineering design and cost
and turns a previous, custom, high-end, redesign-intensive concept
for engineers into a glue-less, place and route, platform-agnostic
technology. This approach is so universal and ideal for system interconnect
that nVidia and SiS use it in their chipsets for Intel Pentium 4
and Xeon platforms instead of the more I/O-centric PCI signaling.
A HyperTransport bridge or tunnel can be attached anywhere HyperTransport
is offered. Adding AGP, PCI-X, PCI, or even the forthcoming PCI-Express
to any mainboard is literally a snap for engineers and actually
reduces some of the pin- and wire-trace count over equivalent, legacy
approaches in RISC or high-end Xeon systems.
In fact, the traditional PC print and Internet media continues
to trip over itself trying to predict whether and when AMD's
x86-64 will be adopted by Apple, IBM, Sun, and other hardware vendors.
This is a testament to how interconnect is so wholly undervalued,
because Apple, IBM, Sun, and many other vendors licensed HyperTransport
for their non-x86 designs, not so much x86-64 itself (although Sun
is porting Solaris to x86-64). HyperTransport solves issues for
non-PC platforms as much as for the PC.
But while HyperTransport addresses design cost, it does not solve
system and I/O interconnect on its own. HyperTransport needed a
flagship platform. Not surprisingly, AMD had one in mind.
Opteron: NUMA and HyperTransport
There is no such thing as an FSB in Athlon 64, much less Opteron.
With the introduction of the Athlon64/Opteron, FSB becomes an extremely
poor measure of interconnect. All AMD x86-64 processors have no
less than 2 (and up to 5) independent system interconnects on each
individual CPU. The points of system interconnect on Socket-754,
939, and 940, as well as Socket-478, 603/604, and LGA-775 are illustrated
in Figure 4.
The Athlon 64 and Opteron completely move the traditional concept
of a northbridge into the CPU itself. Any PC-enthusiast review referencing
a 600 (1200 mega-transaction, 1200MT), 800 (1600MT), or 1000MHz
(2000MT) "FSB" in an Athlon 64 configuration is merely
referencing one of at least two or more points on connection of
the Athlon64/Opteron. Even single-processor Athlon 64 processors
have one or more dedicated memory channels separate from the HyperTransport
I/O.
The key to Opteron 200/800 scalability is in the additional HyperTransport
channels for CPU-to-CPU interconnect. It is not merely a switching
system interconnect versus a bus (as Alpha/Athlon EV6 is to Intel
[A]GTL+); each CPU has direct connections to its own memory, other
CPUs, and then I/O. To draw an analogy to Ethernet: AGTL+ (Xeon
MP) systems are connected to each other, to memory, and to I/O all
through the same Ethernet hub; EV6 (Athlon MP) is like having an
Ethernet switch; and NUMA/HyperTransport (Opteron) is like having
direct Ethernet connections between (no hub or switch).
The memory connections are direct, glue-less, 184-pin DDR SDRAM
channels directly connected to the chip. There is one DDR channel
in Socket-754 and two DDR channels in Socket-939/940 (noting that
754+184=938, there are two, true, full, non-multiplexed DDR buses
in the case of Socket-939/940). There are several advantages to
this. One is the fact that memory speed no longer limits the performance
of any other interconnect. In other words, the memory clock is its
own, independent divider of the local CPU clock -- separate
from all others, including I/O. Using DDR266/PC2100 (2.1 GBps/channel)
instead of DDR400/PC3200 (3.2 GBps/channel) makes no impact on the
CPU speed or HyperTransport clocks. And, it gets better.
In proprietary NUMA Xeon MP, if a program on one CPU needs to
access the memory located on another, it must use the shared, old-style
FSB between CPUs. Not only does only one CPU access that bus at
a time, but it prevents any other transfer -- CPU to/from memory
to/from I/O -- from occurring. In HyperTransport Opteron, each
CPU has a dedicated connection to another CPU (or within one hop
of another, in the case of Opteron 800), so there is no (or limited)
contention with other memory, CPU, or I/O transfers (see Figure
5).
Opteron: I/O MMU
The Opteron's I/O Memory Management Unit (MMU) begins with
EV6 and 32-bit Athlon MP. EV6, designed for 64-bit Alpha, uses 40-bit
physical addressing. Unlike Intel AGTL+, which is a "bus",
EV6 is a point-to-point (up to 16 node) interconnect. As such, to
maintain consistency with select I/O devices (e.g., AGP), AMD had
to move I/O memory management out of the northbridge and on to each
processor. The result was that each 32-bit Athlon could theoretically
(even implemented by a few mainboard BIOSes with a "Linux"
memory access option) and safely handle not only memory addressing
but memory mapped I/O, up to 40-bit (1 TB).
Opteron merely extends this to 48-bit, the largest size compatible
with the 256 TB address space of the i486's Translation Lookaside
Buffer (TLB), with a formal I/O MMU on each Opteron. Opterons directly
talk to each other over dedicated, HyperTransport links with all
memory or I/O localized to each Opteron. So maintaining consistency
with the cache, memory, or memory-mapped I/O of one Opteron is inherently
accommodated in the access from another. Best of all, any transfers
from any I/O to memory are local to the specific Opteron to which
they are connected. Thus, the concept of "processor affinity"
under an Opteron is used not only for executing processes and memory,
but for localizing I/O transfers between memory and the memory mapped
I/O -- the ultimate in server I/O efficiency and performance.
Intel AGTL+, on the other hand, was designed only for 32-bit processors
with 32-bit addressing, or 36-bit addressing using Page Address
Extensions (PAE). Some AGTL+ platforms do not even offer 36-bit
PAE and are physically limited to 32-bit. In IA-32e processors (Intel
64-bit extensions), memory-mapped I/O addressing is only protected
and guaranteed in hardware up to and under 4 GB. The IA-32e also
lacks a true I/O MMU, so it does not implement those x86-64 instructions
found in AMD's 64-bit processors. Under Linux/x86-64 (among
other x86-64 kernels), a "software I/O MMU" has been adopted
for IA-32e platforms to make up for this deficiency versus the AMD
x86-64 implementation.
Until IA-32e switches away from Socket-604 (which Intel has hinted
at doing by merging with IA-64/Itanium's Infiniband interconnect
in 2006+), this is a hard, physical limitation of the platform itself.
Thus, all current and near-future IA-32e servers should not be considered
for I/O-intensive applications when the solution calls for more
than 4 GB of physical RAM (combined system and memory-mapped I/O).
Evaluating Opteron Solutions
Now that I've shown that the Opteron is the greatest thing
since sliced bread, it's time to talk about evaluating Opteron
solutions. OEMs often cut corners in products, and the Opteron is
no exception. In fact, the Opteron is so flexible, it can actually
be designed to perform worse than Intel solutions. Figure 6 shows
an example of a very common, low-cost, dual-Opteron mainboard where
the OEM has cut corners.
Why might an OEM do this? Traces on a mainboard cost money. Since
each DDR memory channel is no less than 184 pins, each individual
Opteron requires 368 additional traces just for memory (not including
I/O or connections to other CPUs). Opterons can use memory from
other Opterons in the same system, so it's possible for one
Opteron to have no memory attached and to access the memory connected
to another Opteron. Mainboard designs as illustrated in Figure 6
are more common than you might think.
Another consideration also illustrated is the lack of an AMD8131/8132
dual-PCI-X 1.0/2.0 HyperTransport tunnel. Even with all that interconnect,
most low-cost Opteron mainboards ship with just an AMD8111 PCI/LPC
HyperTransport bridge, which offers only 0.125GBps of I/O (excluding
of video). Most nVidia, SiS, and ViA chipsets have similar limitations
in I/O, rendering the interconnect advantages of Opteron useless
outside of memory performance. In these chipsets, even though the
second Opteron 200 in a dual-processor system has an open and available
HyperTransport for I/O, the cost of the chip and traces for PCI-X
would increase costs.
So, although the Opteron offers a commodity, high-performing I/O
platform for server applications at a price point far lower than
RISC, Itanium, and NUMA Xeon MP, the market is also flooded with
cheap Opteron designs available for less than $200. System integrators
and systems administrators alike must be careful to purchase Opteron
200/800 systems with quality mainboard implementations.
Here is a checklist that will help ensure you are purchasing a
high-quality system:
- Does each (not just one) Opteron have 144-bit (dual-DDR ECC)
memory channels?
- Does the system offer more than one shared, legacy PCI bus
(e.g., least one AMD8131/8132 HyperTransport dual-PCI-X 1.0/2.0
tunnel)?
- Are all of the dual-PCI-X HyperTransport tunnels/bridges daisy-chained
to one CPU? That is, are the dual-PCI-X HyperTransport tunnels
each connected to one CPU, in a 4-way system?
The difference in design can mean the difference between having
a system that performs like the AMD reference designs and a system
that offers less than a quality Xeon MP system.
PCI-Express and DDR2
By the time this article is published, two new technologies will
have arrived -- PCI-Express and DDR2 SDRAM. Both are being readily
implemented on the single processor, consumer Intel LGA-775 platform,
although Socket-604 Xeon MP adoption is forthcoming.
PCI-Express is both a graphics and generic I/O interconnect. It
can co-exist with AGP or replace it. From the standpoint of storage
and network connectivity, each x1 (and rarer x4) slot can provide
0.125 GBps+ of dedicated bandwidth. These are ideal for entry-level,
single-processor servers versus today's single, shared, 0.125-GBps
PCI systems. But multiple PCI-X channels are still recommended for
higher-end systems and I/O. In fact, one major OEM has prototyped
a dual-processor workstation/server that attaches an nVidia PCI-Express
chipset (with x16, x1/x4 slots) to one Opteron, while attaching
the AMD8131 dual-PCI-X tunnel to another. Such is the flexibility
that HyperTransport affords system designers, at a price point that
was previously unheard of in any platform (sub-$500 list).
Sun Microsystems is taking HyperTransport one step further in
a new Opteron workstation line. They use a modular, 6.4-GBps HyperTransport
connector between the top and bottom portions of the mainboard.
The end-user can buy the bottom portion with AGP/PCI-X tunnels today
and replace it with PCI-Express tunnels tomorrow. Meanwhile, Sun's
Opteron 100/200 processor solution already offers as much I/O today
(four PCI-X channels using two AMD8131 tunnels, in addition the
AGP and legacy PC I/O) as AMD's reference Quad-Opteron 800
design.
DDR2 enthusiasts will be quick to note that Intel is offering
533MHz and, soon, 667MHz JEDEC DDR2 memory module compatibility.
Although the throughput looks good on paper, AGTL+ is still a shared
system interconnect with a synchronous clock. The result is limited
potential for the additional throughput using the new technology
on even Socket-604 with IA-32e. Although some reviews have argued
that AMD is less flexible because the memory controller is on the
CPU, and not the chipset, there is nothing stopping AMD from releasing
Opterons with DDR2 memory support. AMD has hinted that it will do
so, if and when the demand is there, which is seemingly associated
with waiting for more commodity pricing at this time. Unlike shared
AGTL+, the performance of CPU or I/O interconnect in Opteron is
not reliant on and is far less affected by memory clock.
Conclusion
AMD's Opteron opens up the previously exclusive realm of
high-performing NUMA and I/O systems design to the masses at a price
point far below a RISC or Itanium product. But, as with any commodity,
OEMs will cut corners and offer less than ideal designs as the cost
of traces and extra HyperTransport bridge/tunnel chips increase
mainboard manufacturing costs.
System integrators and systems administrators must be wary of
these products and carefully review specification sheets to verify
that solutions are designed well. In this article, I introduced
these concepts along with a checklist of capabilities to look for
in an Opteron solution, including considerations for forthcoming
PCI-Express and DDR2 systems.
When all else fails, trusting a reputable OEM is always of significant
value. Hewlett-Packard's (HP) ProLiant DL585 is one of the
first Opteron 800 solutions to implement the full AMD reference
design of two AMD8131/8132 dual-PCI-X HyperTransport tunnels. Even
running legacy 32-bit Windows Server and applications, the DL585
is tearing up the benchmarks and outclassing other, far more costly
2-, 4-, and even 8-way systems at TPC-C, SAP, and MS Exchange benchmarks.
Various enthusiast sites have found similar performance in the scalability
of 2- and 4-way Opteron 200/800 systems with 32-bit Windows Server.
They've reported far higher capabilities with Linux/x86-64
and full support for its I/O MMU that Intel IA-32e does not offer
in Socket-604.
I sincerely hope that the Front Side Bus (FSB) be dropped as a
valid benchmark of interconnect. It assumes only one point of connection
from the CPU to the rest of the system, which is no longer true
even in AMD's "new Duron" -- the Socket-754 Athlon64.
Therefore, FSB is not a valid reference even for consumer desktops,
much less the enterprise Opteron 200/800. Thus, to end this article,
I offer my new term of reference -- Aggregate Front Side Throughout
(AFST), measured in GBps for the best of each individual processor
type in Table 1.
Bryan J. Smith has an educational background in computer architecture
(BSCpE, UCF) and has spent much of his 12-year engineering/IT career
applying systems design principles in the aerospace, civil, educational,
financial, semiconductor, and software engineering industries. For
the past 4 years, he has provided engineering, IT, and training
services to a variety of clients. Bryan and his wife, Lourdes, live
in Orlando. He can be contacted at: b.j.smith@ieee.org. |