An Overview
of IBM p5 Virtualization Features
Ron Jachim
If your management is like mine, they want to see their expensive
computing resources utilized as fully as possible. This often contradicts
the common expectation of being able to run additional applications
without purchasing additional hardware. The management where I work
has requested that we show better utilization. Incorrect figures
abound as to our current utilization, but everyone believes it can
be improved.
One approach is to use VMware or Microsoft Virtual Server to run
multiple virtual servers on a single physical server. Although such
a setup can better utilize an Intel server, there is only so much
you can do with an Intel server. In this article, I'll describe
the virtualization features of IBM's new POWER5 architecture and
examine how they can help sys admins improve the utilization of
resources.
The POWER5 architecture takes virtualization capabilities previously
available only on mainframes and makes them available to AIX, Linux,
and even i5/OS on the latest pSeries hardware. The latest series
of IBM hardware is the p5xx series using the POWER5 chip. The slightly
older p6xx series hardware, such as the p670 and p695, runs the
POWER4 chip; this is not a typographical error.
Mainframes have given us this sort of capability with logical
partitions (LPARs) for some time, but not all applications run on
z/OS. Modern Unix servers also provide some LPAR capabilities, but
IBM has set the new standard with its POWER5 architecture. They
have done this by making many "mainframe" features available on
the introducing the POWER5 architecture to dramatically improve
virtualization of computing resources, including CPU, memory, network,
and storage.
POWER5 Architecture
One interesting concept is the combination of two processor cores
and their shared 1.92-MB L2 cache onto a single POWER5 chip. The
chip then connects to an (external) 32-MB L3 cache. This arrangement
can be seen in Figures 1 and 2. AMD and Intel offer similar strategies.
AMD has several offerings of its dual-core Opteron, including models
165-175, 265-275, and 865-875. Intel offers the Intel Pentium Processor
Extreme Edition 840 at the time of this writing.
These units are then packaged in various ways to enable building
larger systems. The Multi-Chip Module (or MCM) combines four chips
(eight processor cores) and their L3 Cache (144 MB total L3 cache)
into a single unit. Two of these MCMs can be tightly coupled into
a book containing 16 processor cores (see Figure 1). This is the
basic building block for the enterprise class p590 and p595 systems.
For those needing less capacity, the p5 processor can be mounted
onto a processor card with its L3 cache as a Dual-Chip Module (or
DCM). This technique is used in the entry and midrange systems.
The DCM can also be mounted directly to the system planar (see Figure
2). This is the basic building block for the midrange p520, p550,
and p570 systems.
The p590 and p595 systems come with up to four I/O drawers; each
drawer has two planars (backplanes). Each planar has two U320 SCSI
buses to connect to internal storage as well as three separate PCI-X
buses with a total of ten PCI-X slots available. The two U320 SCSI
interfaces are also connected to the PCI-X bus. Please see Figure
3 for the I/O drawer arrangement. Each planar connects to the other
planars both internally to the I/O drawer and to other I/O drawers
through the RIO. The midrange systems combine processing and I/O
into a scaled-down arrangement, but these may also add I/O drawers,
although they are still scaled-down versions of what is found in
the p590 and p595.
Now that I've described the building blocks, I'll explain how
this architecture can be harnessed. Simultaneous multi-threading
(SMT) is supported along with single threading (ST). With SMT, the
operating system sees four instruction queues (i.e., four processors)
per chip. However, the real advantage to this type of architecture
becomes available through its virtualization features. The flexibility
of this architecture allows the building of systems ranging from
a single processor to 64 processors. In the sections that follow,
I'll discuss various techniques for virtualization as they apply
to the POWER5 architecture.
In addition to the base POWER5 system, there is a Hardware Management
Console (HMC) running on Intel server that manages the individual
partitions and resources. It has two Ethernet adapters, one for
communication with the p5 server and the other for communication
with the outside world.
Just as you have to pay extra for the optional towing package
on a car, IBM gives you the option of purchasing the Advanced POWER
Virtualization feature package. Most of the features below are a
part of this feature set. The Hypervisor is always available. Virtual
Ethernet is available without this feature package, as is the HMC.
SMT, mentioned previously, is also a part of the base hardware.
The Partition Load Manager (PLM) is a software-based resource
manager. It monitors CPU and memory load and uses a set of threshold-based
policies to dynamically adjust CPU and memory allocations. The PLM
only manages AIX partitions.
POWER5 Hypervisor
The POWER5 Hypervisor is the mechanism that enables much of the
virtualization on p5 servers. It can be thought of as the firmware
that controls a POWER5 system, whether it is an entry-level, single
CPU system, or an enterprise-class 64 processor system. All of the
virtualization techniques discussed here are rooted in the Hypervisor.
The Hypervisor allows access to dynamic micro-partitioning, shared
processor pools, dynamic LPARs, Capacity on Demand (CoD), Virtual
I/O, and Virtual LAN. It provides a level of abstraction between
the processing part of the system (i.e., individual operating system
instances or partitions) and the I/O part of the system (i.e., the
physical hardware that the partitions are using, such as disks or
adapters).
The Hypervisor provides a special interrupt mechanism to allow
sharing of processors by multiple partitions and to allow the Hypervisor
itself to get the processor cycles it needs for its own use. A single
physical processor may be divided into multiple virtual processors,
each with its own instruction queue. It also has a special register
to accurately measure activity during time slices. This allows Hypervisor
to decide how to allocate physical processors to system partitions.
VIO Servers
The VIO servers enable I/O virtualization. They are specialized
partitions that run AIX and, upon logging onto a VIO server, the
administrator is presented with a restricted version of the Korn
shell as a command-line interface. The admin has access to device
commands, network configuration, security, user management, etc.
Typically, one would want to configure two VIO servers for redundancy.
The VIO servers will be mentioned below where they play a role in
virtualization.
Partitioning and Micro-partitioning
With the POWER4 architecture, each partition had one or more whole
CPUs dedicated to it. This allowed partitioning of a system but
also had several limitations. If you had two systems whose system
utilization varied in a complementary way, say an OLTP system used
during business hours and a batch system running its major batch
at midnight, it might make sense to allocate 95% of the CPU to the
OLTP system during the day and allocate 95% of the CPU to the batch
system at night.
The POWER5 allows you to set up processor pools that can be shared
between different systems. The smallest partition size is 1/10 of
a physical processor, but partitions may be fine-tuned to a physical
processor unit the size of 1/100 of a CPU. This is referred to as
micro-partitioning. In other words, the minimum partition size is
1/10 of a CPU, but you may have a partition that is 11/100 of a
CPU and another that is 89/100 of a CPU. This allocation is performed
by the Hypervisor, which handles the time slice function and allocates
the time slices between virtual partitions. The HMC has a GUI interface,
which allows you to specify the Minimum, Maximum, and Desired CPU
allocations.
You have the option in running in "capped" mode, where the CPU
is allocated in percentages. In this mode, if there is idle capacity
in one partition, other partitions are unable to use that capacity
if it exceeds their maximum entitlement. In uncapped mode, that
excess capacity can be allocated to other CPUs based upon the weights
of the partitions. A weight is a number between 0 and 255, with
the default being 128. A weight of zero really means it is a capped
partition.
Virtual processors are used to spread the processing power across
a partition. In short, there is one virtual processor for every
whole processing unit or faction thereof. There is a maximum of
64 virtual processors per partition, though that may change over
time. You also have the option of dedicating whole processors to
specific partitions. Shared and dedicated CPUs may not be mixed
in a single partition.
It is important to note that the partitions might run AIX or SUSE
Linux Enterprise Server or Red Hat Enterprise Linux or even i5/OS
(OS/400). This ability to partition a system exists at the hardware/firmware
level and is available when the partitions are initialized. Some
of the virtualization features that follow may be available in AIX,
but others transcend the operating system.
Dynamic LPARs were introduced with AIX 5L Version 5.2 and are
available to that version of AIX and higher. Whereas the partitioning
described previously occurs at a hardware/firmware level with Hypervisor
and is controlled by the HMC, dynamic LPARs can be fine-tuned while
running. A dynamic LPAR consists of one or more dedicated processors,
as well as 256 MB of memory and PCI adapter slots. Dynamic LPARs
may have their memory tuned on the fly in 16-MB chunks. They can
also have both virtual adapters and physical I/O adapter slots added
and removed on the fly.
As you might imagine, there are a number of limitations with partitioning.
There are physical limits as to the maximum number of partitions
(254), virtual processors in a partition (64), etc. These specifications
can be found in IBM Redbooks. More importantly, as with all performance
tuning there is a fine balancing act between the number of partitions
and their associated virtual processors and the performance to be
gained from partitioning. Virtual processors incur a slight dispatch
latency because they are scheduled. The worst case dispatch latency
is 18 ms. To avoid this, you can increase the processor entitlement,
which will minimize the number of virtual processors.
As can be inferred from the above discussion, increasing the utilization
of a system may increase the amount of systems administration (read:
labor) required to keep those systems running efficiently.
Virtual Ethernet
There are two important concepts of virtualizing network connections.
Virtual Ethernet allows two partitions on the same system to communicate
at Gigabit Ethernet speeds without using a network adapter at all.
Let's say you have one Web server that talks with one WebSphere
(J2EE) Application Server. This WebSphere server talks with one
Oracle database server. If you're in a factory environment, for
example, and you want to support all three tiers on a single system,
you begin by subdividing the system into three partitions. Now the
Web server needs to talk to Web clients, so it needs external network
connectivity.
The WebSphere server, however, only talks to the Web server and
the database server; so, in theory, it does not need to be connected
to an external network. Instead, it can be configured with a virtual
Ethernet adapter. Similarly, the database server only talks to the
WebSphere server. Thus, it need not communicate outside the system
either and could also be handled with a virtual Ethernet adapter.
From a practical point of view, you still need the ability to log
into the servers, so an externally facing network adapter may still
be appropriate.
This functionality is handled by VLANs. In this example, a VLAN
would be set up between the virtual Ethernet adapter in the Web
server and one of the virtual Ethernet adapters in the WebSphere
server. A second VLAN would be set up between the other virtual
Ethernet adapter in the WebSphere server and the virtual Ethernet
adapter in the Oracle server. VLANs provide some measure of traffic
separation between the different communication channels.
There are several advantages to this sort of approach. You can
reduce the amount of network infrastructure necessary to support
your partitions. Rather than having separate physical NICs in each
partition, you can allocate the physical NICs as they are needed
and use virtual Ethernet adapters for everything else. Also, by
reducing the number of physical adapters, you reduce the number
of switch ports and fiber/copper runs in the data center, because
these are handled internally to the POWER5 system.
As with everything, there are tradeoffs. Because you reduce the
number of NICs with their associated processing capability, you
will encounter an increase in CPU utilization with this setup. However,
this is unlikely to be a significant factor in most cases.
Shared Ethernet Adapter
In contrast with the virtual Ethernet adapter, the shared Ethernet
adapter provides the partitions with the means of sharing one physical
adapter to enable communication outside of the system. The primary
advantage here is that you don't need a physical adapter for every
partition.
This is accomplished by using a virtual Ethernet adapter on each
of the partitions that will share a physical Ethernet adapter. These
partitions communicate with one or two special partitions called
VIO servers. These VIO servers host the physical adapters and have
virtual Ethernet adapters to connect with each of the other partitions
using them. This technique still depends on virtual Ethernet adapters
in the partitions sharing an adapter. As a result, shared Ethernet
may also affect CPU utilization.
You can also use Link Aggregation (EtherChannel) to provide even
higher bandwidth between a system and a switch by sharing several
physical NICs in such a way that they appear to the system as a
single logical NIC.
Virtual SCSI and Fibre Channel
The IBM documentation refers to virtual SCSI, but this also means
virtual Fibre Channel. In effect, Virtual SCSI is the same concept
as shared Ethernet adapters (not virtual Ethernet adapters).
In addition to the ability to share physical adapters, this feature
also allows non-AIX partitions (say Linux partitions) to communicate
with storage that would otherwise only support connectivity to AIX
systems. It also permits physical disks associated with a VIO server
to be accessed by multiple partitions in whole or in part through
the use of logical volume manager.
This feature provides the ability to share both disks and adapters
with this capability. It offers connectivity to parallel SCSI, SCSI
RAID, or Fibre Channel disks, but not to SSA, CD-ROM, or tape. The
SCSI protocol contains mandatory and optional commands. While all
mandatory commands are supported, the optional commands may not
be. As with shared and virtual Ethernet, performance will be a consideration.
Although boot disks and Web servers may be appropriate applications
because they tend to cache data, the heavy I/O demands of a database
server may make this an inappropriate solution. (See Figure 4 and
sidebar for diagram and additional explanation of virtualization
features.)
Conclusion
This has been a very quick introduction to the architecture and
virtualization features available on the IBM p5 line of servers.
Figure 4 gives a graphical summary of many of the virtualization
features discussed in this article. Obviously, I've provided just
a brief overview, but I hope this information will encourage you
to further explore the convergence of Unix and mainframe systems.
I've been in the computer field long enough that I learned IBM 360/370
assembly language programming in college, and I find it fascinating
that the two previously competing lines of hardware are converging.
You should not read this article and think that IBM is the only
company making strides in virtualization. AMD, HP, Intel, and Sun
are also continuing to make improvements in this area. The more
important point is that virtualization may offer you the ability
to better manage and utilize computing resources.
Resources
If you are unfamiliar with IBM's Redbooks, go to:
http://www.redbooks.ibm.com
I drew mostly from the first two resources listed below. I had a draft
of the book on Performance Considerations, but it was pulled from
the Web site so I was reluctant to use it in this article. The others
are about specific p5 server implementations:
Adra, Bill, Annika Blank, Mariusz Gieparda, Joachim Haust, Oliver
Stadler, and Doug Szerdi. October, 2004. Advanced POWER Virtualization
on IBM eServer p5 Servers: Introduction and Basic Configuration.
IBM Redbook. ISBN: 0738490814.
Anselmi, Giuliano, Gregor Linzmeier, Wolfgang Seiwald, Philippe
Vandamme, and Scott Vetter. October, 2004. IBM eServer p5 520
Technical Overview and Introduction. IBM Redbook: REDP-9111-01.
Anselmi, Giuliano, Gregor Linzmeier, Wolfgang Seiwald, Philippe
Vandamme, and Scott Vetter. October, 2004. IBM eServer p5 550
Technical Overview and Introduction. IBM Redbook: REDP-9113-01.
Anselmi, Giuliano, Gregor Linzmeier, Wolfgang Seiwald, and Philippe
Vandamme. July, 2004. IBM eServer p5 570 Technical Overview and
Introduction. IBM Redbook: REDP-9117-00.
Dornberg, Peter, Nia Kelley, TaiJung Kim, and Ding Wei. March,
2005. IBM eServer p5 590 and 595 System Handbook. IBM Redbook.
ISBN: 0738490547.
Gibbs, Ben, Frank Berres, Lancelot Castillo, Pedro Coelho, Cesar
Diniz Maciel, and Ravikiran Thirumalai. August, 2004 (Draft). IBM
eServer p5 Virtualization Performance Considerations. IBM Redbook.
Irving, Nic, Mathew Jenner, and Arsi Kortesniemi. February, 2005.
Partitioning Implementations for IBM eServer p5 Servers.
IBM Redbook. ISBN: 0738492140.
Ron Jachim has an M.S. in Computer Science and 20 years of
systems experience extending from PC Tech to DBA to Unix and network
administration. He worked in management of those areas for more
than 7 years. Ron is currently a systems engineer for Fast Switch
Ltd. on the Ford Motor Company account doing architectural work
in the Linux/AIX/Oracle/J2EE space. He is also Adjunct Faculty at
Wayne State University where he teaches Linux and Network Administration.
He enjoys the technical challenge of his current positions but hopes
to return to IT management again someday. He can be reached at:
Ron_Jachim@Hotmail.com. |