Cover V14, i10
oct2005.tar

An Overview of IBM p5 Virtualization Features

Ron Jachim

If your management is like mine, they want to see their expensive computing resources utilized as fully as possible. This often contradicts the common expectation of being able to run additional applications without purchasing additional hardware. The management where I work has requested that we show better utilization. Incorrect figures abound as to our current utilization, but everyone believes it can be improved.

One approach is to use VMware or Microsoft Virtual Server to run multiple virtual servers on a single physical server. Although such a setup can better utilize an Intel server, there is only so much you can do with an Intel server. In this article, I'll describe the virtualization features of IBM's new POWER5 architecture and examine how they can help sys admins improve the utilization of resources.

The POWER5 architecture takes virtualization capabilities previously available only on mainframes and makes them available to AIX, Linux, and even i5/OS on the latest pSeries hardware. The latest series of IBM hardware is the p5xx series using the POWER5 chip. The slightly older p6xx series hardware, such as the p670 and p695, runs the POWER4 chip; this is not a typographical error.

Mainframes have given us this sort of capability with logical partitions (LPARs) for some time, but not all applications run on z/OS. Modern Unix servers also provide some LPAR capabilities, but IBM has set the new standard with its POWER5 architecture. They have done this by making many "mainframe" features available on the introducing the POWER5 architecture to dramatically improve virtualization of computing resources, including CPU, memory, network, and storage.

POWER5 Architecture

One interesting concept is the combination of two processor cores and their shared 1.92-MB L2 cache onto a single POWER5 chip. The chip then connects to an (external) 32-MB L3 cache. This arrangement can be seen in Figures 1 and 2. AMD and Intel offer similar strategies. AMD has several offerings of its dual-core Opteron, including models 165-175, 265-275, and 865-875. Intel offers the Intel Pentium Processor Extreme Edition 840 at the time of this writing.

These units are then packaged in various ways to enable building larger systems. The Multi-Chip Module (or MCM) combines four chips (eight processor cores) and their L3 Cache (144 MB total L3 cache) into a single unit. Two of these MCMs can be tightly coupled into a book containing 16 processor cores (see Figure 1). This is the basic building block for the enterprise class p590 and p595 systems.

For those needing less capacity, the p5 processor can be mounted onto a processor card with its L3 cache as a Dual-Chip Module (or DCM). This technique is used in the entry and midrange systems. The DCM can also be mounted directly to the system planar (see Figure 2). This is the basic building block for the midrange p520, p550, and p570 systems.

The p590 and p595 systems come with up to four I/O drawers; each drawer has two planars (backplanes). Each planar has two U320 SCSI buses to connect to internal storage as well as three separate PCI-X buses with a total of ten PCI-X slots available. The two U320 SCSI interfaces are also connected to the PCI-X bus. Please see Figure 3 for the I/O drawer arrangement. Each planar connects to the other planars both internally to the I/O drawer and to other I/O drawers through the RIO. The midrange systems combine processing and I/O into a scaled-down arrangement, but these may also add I/O drawers, although they are still scaled-down versions of what is found in the p590 and p595.

Now that I've described the building blocks, I'll explain how this architecture can be harnessed. Simultaneous multi-threading (SMT) is supported along with single threading (ST). With SMT, the operating system sees four instruction queues (i.e., four processors) per chip. However, the real advantage to this type of architecture becomes available through its virtualization features. The flexibility of this architecture allows the building of systems ranging from a single processor to 64 processors. In the sections that follow, I'll discuss various techniques for virtualization as they apply to the POWER5 architecture.

In addition to the base POWER5 system, there is a Hardware Management Console (HMC) running on Intel server that manages the individual partitions and resources. It has two Ethernet adapters, one for communication with the p5 server and the other for communication with the outside world.

Just as you have to pay extra for the optional towing package on a car, IBM gives you the option of purchasing the Advanced POWER Virtualization feature package. Most of the features below are a part of this feature set. The Hypervisor is always available. Virtual Ethernet is available without this feature package, as is the HMC. SMT, mentioned previously, is also a part of the base hardware.

The Partition Load Manager (PLM) is a software-based resource manager. It monitors CPU and memory load and uses a set of threshold-based policies to dynamically adjust CPU and memory allocations. The PLM only manages AIX partitions.

POWER5 Hypervisor

The POWER5 Hypervisor is the mechanism that enables much of the virtualization on p5 servers. It can be thought of as the firmware that controls a POWER5 system, whether it is an entry-level, single CPU system, or an enterprise-class 64 processor system. All of the virtualization techniques discussed here are rooted in the Hypervisor.

The Hypervisor allows access to dynamic micro-partitioning, shared processor pools, dynamic LPARs, Capacity on Demand (CoD), Virtual I/O, and Virtual LAN. It provides a level of abstraction between the processing part of the system (i.e., individual operating system instances or partitions) and the I/O part of the system (i.e., the physical hardware that the partitions are using, such as disks or adapters).

The Hypervisor provides a special interrupt mechanism to allow sharing of processors by multiple partitions and to allow the Hypervisor itself to get the processor cycles it needs for its own use. A single physical processor may be divided into multiple virtual processors, each with its own instruction queue. It also has a special register to accurately measure activity during time slices. This allows Hypervisor to decide how to allocate physical processors to system partitions.

VIO Servers

The VIO servers enable I/O virtualization. They are specialized partitions that run AIX and, upon logging onto a VIO server, the administrator is presented with a restricted version of the Korn shell as a command-line interface. The admin has access to device commands, network configuration, security, user management, etc. Typically, one would want to configure two VIO servers for redundancy. The VIO servers will be mentioned below where they play a role in virtualization.

Partitioning and Micro-partitioning

With the POWER4 architecture, each partition had one or more whole CPUs dedicated to it. This allowed partitioning of a system but also had several limitations. If you had two systems whose system utilization varied in a complementary way, say an OLTP system used during business hours and a batch system running its major batch at midnight, it might make sense to allocate 95% of the CPU to the OLTP system during the day and allocate 95% of the CPU to the batch system at night.

The POWER5 allows you to set up processor pools that can be shared between different systems. The smallest partition size is 1/10 of a physical processor, but partitions may be fine-tuned to a physical processor unit the size of 1/100 of a CPU. This is referred to as micro-partitioning. In other words, the minimum partition size is 1/10 of a CPU, but you may have a partition that is 11/100 of a CPU and another that is 89/100 of a CPU. This allocation is performed by the Hypervisor, which handles the time slice function and allocates the time slices between virtual partitions. The HMC has a GUI interface, which allows you to specify the Minimum, Maximum, and Desired CPU allocations.

You have the option in running in "capped" mode, where the CPU is allocated in percentages. In this mode, if there is idle capacity in one partition, other partitions are unable to use that capacity if it exceeds their maximum entitlement. In uncapped mode, that excess capacity can be allocated to other CPUs based upon the weights of the partitions. A weight is a number between 0 and 255, with the default being 128. A weight of zero really means it is a capped partition.

Virtual processors are used to spread the processing power across a partition. In short, there is one virtual processor for every whole processing unit or faction thereof. There is a maximum of 64 virtual processors per partition, though that may change over time. You also have the option of dedicating whole processors to specific partitions. Shared and dedicated CPUs may not be mixed in a single partition.

It is important to note that the partitions might run AIX or SUSE Linux Enterprise Server or Red Hat Enterprise Linux or even i5/OS (OS/400). This ability to partition a system exists at the hardware/firmware level and is available when the partitions are initialized. Some of the virtualization features that follow may be available in AIX, but others transcend the operating system.

Dynamic LPARs were introduced with AIX 5L Version 5.2 and are available to that version of AIX and higher. Whereas the partitioning described previously occurs at a hardware/firmware level with Hypervisor and is controlled by the HMC, dynamic LPARs can be fine-tuned while running. A dynamic LPAR consists of one or more dedicated processors, as well as 256 MB of memory and PCI adapter slots. Dynamic LPARs may have their memory tuned on the fly in 16-MB chunks. They can also have both virtual adapters and physical I/O adapter slots added and removed on the fly.

As you might imagine, there are a number of limitations with partitioning. There are physical limits as to the maximum number of partitions (254), virtual processors in a partition (64), etc. These specifications can be found in IBM Redbooks. More importantly, as with all performance tuning there is a fine balancing act between the number of partitions and their associated virtual processors and the performance to be gained from partitioning. Virtual processors incur a slight dispatch latency because they are scheduled. The worst case dispatch latency is 18 ms. To avoid this, you can increase the processor entitlement, which will minimize the number of virtual processors.

As can be inferred from the above discussion, increasing the utilization of a system may increase the amount of systems administration (read: labor) required to keep those systems running efficiently.

Virtual Ethernet

There are two important concepts of virtualizing network connections. Virtual Ethernet allows two partitions on the same system to communicate at Gigabit Ethernet speeds without using a network adapter at all. Let's say you have one Web server that talks with one WebSphere (J2EE) Application Server. This WebSphere server talks with one Oracle database server. If you're in a factory environment, for example, and you want to support all three tiers on a single system, you begin by subdividing the system into three partitions. Now the Web server needs to talk to Web clients, so it needs external network connectivity.

The WebSphere server, however, only talks to the Web server and the database server; so, in theory, it does not need to be connected to an external network. Instead, it can be configured with a virtual Ethernet adapter. Similarly, the database server only talks to the WebSphere server. Thus, it need not communicate outside the system either and could also be handled with a virtual Ethernet adapter. From a practical point of view, you still need the ability to log into the servers, so an externally facing network adapter may still be appropriate.

This functionality is handled by VLANs. In this example, a VLAN would be set up between the virtual Ethernet adapter in the Web server and one of the virtual Ethernet adapters in the WebSphere server. A second VLAN would be set up between the other virtual Ethernet adapter in the WebSphere server and the virtual Ethernet adapter in the Oracle server. VLANs provide some measure of traffic separation between the different communication channels.

There are several advantages to this sort of approach. You can reduce the amount of network infrastructure necessary to support your partitions. Rather than having separate physical NICs in each partition, you can allocate the physical NICs as they are needed and use virtual Ethernet adapters for everything else. Also, by reducing the number of physical adapters, you reduce the number of switch ports and fiber/copper runs in the data center, because these are handled internally to the POWER5 system.

As with everything, there are tradeoffs. Because you reduce the number of NICs with their associated processing capability, you will encounter an increase in CPU utilization with this setup. However, this is unlikely to be a significant factor in most cases.

Shared Ethernet Adapter

In contrast with the virtual Ethernet adapter, the shared Ethernet adapter provides the partitions with the means of sharing one physical adapter to enable communication outside of the system. The primary advantage here is that you don't need a physical adapter for every partition.

This is accomplished by using a virtual Ethernet adapter on each of the partitions that will share a physical Ethernet adapter. These partitions communicate with one or two special partitions called VIO servers. These VIO servers host the physical adapters and have virtual Ethernet adapters to connect with each of the other partitions using them. This technique still depends on virtual Ethernet adapters in the partitions sharing an adapter. As a result, shared Ethernet may also affect CPU utilization.

You can also use Link Aggregation (EtherChannel) to provide even higher bandwidth between a system and a switch by sharing several physical NICs in such a way that they appear to the system as a single logical NIC.

Virtual SCSI and Fibre Channel

The IBM documentation refers to virtual SCSI, but this also means virtual Fibre Channel. In effect, Virtual SCSI is the same concept as shared Ethernet adapters (not virtual Ethernet adapters).

In addition to the ability to share physical adapters, this feature also allows non-AIX partitions (say Linux partitions) to communicate with storage that would otherwise only support connectivity to AIX systems. It also permits physical disks associated with a VIO server to be accessed by multiple partitions in whole or in part through the use of logical volume manager.

This feature provides the ability to share both disks and adapters with this capability. It offers connectivity to parallel SCSI, SCSI RAID, or Fibre Channel disks, but not to SSA, CD-ROM, or tape. The SCSI protocol contains mandatory and optional commands. While all mandatory commands are supported, the optional commands may not be. As with shared and virtual Ethernet, performance will be a consideration. Although boot disks and Web servers may be appropriate applications because they tend to cache data, the heavy I/O demands of a database server may make this an inappropriate solution. (See Figure 4 and sidebar for diagram and additional explanation of virtualization features.)

Conclusion

This has been a very quick introduction to the architecture and virtualization features available on the IBM p5 line of servers. Figure 4 gives a graphical summary of many of the virtualization features discussed in this article. Obviously, I've provided just a brief overview, but I hope this information will encourage you to further explore the convergence of Unix and mainframe systems. I've been in the computer field long enough that I learned IBM 360/370 assembly language programming in college, and I find it fascinating that the two previously competing lines of hardware are converging.

You should not read this article and think that IBM is the only company making strides in virtualization. AMD, HP, Intel, and Sun are also continuing to make improvements in this area. The more important point is that virtualization may offer you the ability to better manage and utilize computing resources.

Resources

If you are unfamiliar with IBM's Redbooks, go to:

http://www.redbooks.ibm.com
I drew mostly from the first two resources listed below. I had a draft of the book on Performance Considerations, but it was pulled from the Web site so I was reluctant to use it in this article. The others are about specific p5 server implementations:

Adra, Bill, Annika Blank, Mariusz Gieparda, Joachim Haust, Oliver Stadler, and Doug Szerdi. October, 2004. Advanced POWER Virtualization on IBM eServer p5 Servers: Introduction and Basic Configuration. IBM Redbook. ISBN: 0738490814.

Anselmi, Giuliano, Gregor Linzmeier, Wolfgang Seiwald, Philippe Vandamme, and Scott Vetter. October, 2004. IBM eServer p5 520 Technical Overview and Introduction. IBM Redbook: REDP-9111-01.

Anselmi, Giuliano, Gregor Linzmeier, Wolfgang Seiwald, Philippe Vandamme, and Scott Vetter. October, 2004. IBM eServer p5 550 Technical Overview and Introduction. IBM Redbook: REDP-9113-01.

Anselmi, Giuliano, Gregor Linzmeier, Wolfgang Seiwald, and Philippe Vandamme. July, 2004. IBM eServer p5 570 Technical Overview and Introduction. IBM Redbook: REDP-9117-00.

Dornberg, Peter, Nia Kelley, TaiJung Kim, and Ding Wei. March, 2005. IBM eServer p5 590 and 595 System Handbook. IBM Redbook. ISBN: 0738490547.

Gibbs, Ben, Frank Berres, Lancelot Castillo, Pedro Coelho, Cesar Diniz Maciel, and Ravikiran Thirumalai. August, 2004 (Draft). IBM eServer p5 Virtualization Performance Considerations. IBM Redbook.

Irving, Nic, Mathew Jenner, and Arsi Kortesniemi. February, 2005. Partitioning Implementations for IBM eServer p5 Servers. IBM Redbook. ISBN: 0738492140.

Ron Jachim has an M.S. in Computer Science and 20 years of systems experience extending from PC Tech to DBA to Unix and network administration. He worked in management of those areas for more than 7 years. Ron is currently a systems engineer for Fast Switch Ltd. on the Ford Motor Company account doing architectural work in the Linux/AIX/Oracle/J2EE space. He is also Adjunct Faculty at Wayne State University where he teaches Linux and Network Administration. He enjoys the technical challenge of his current positions but hopes to return to IT management again someday. He can be reached at: Ron_Jachim@Hotmail.com.