Cover V13, i14

Article

dec2004.tar

Q&A with the SolarisTM 10 Engineers

Peter Baer Galvin

Recently, I had the unusual opportunity to ask Sun's operating system engineering team a variety of questions related to Solaris 10. I found the answers that they provided to be fascinating. I hope you do, too.

Thanks to:

  • Matthew Ahrens, Solaris Engineer
  • Bryan Cantrill, Solaris Kernel Development
  • Berny Goodheart, Lead Engineer for Project Janus
  • Adam Leventhal, Solaris Kernel Engineer
  • Mike Shapiro, Solaris Kernel Development
  • Sunay Tripathi, Solaris High Performance Networking
  • Andrew Tucker, Solaris Zones Architect
Q I think the world is getting the word about what a powerful tool DTrace is for debugging and performance tuning Solaris 10 and applications running there. What is the next big DTrace feature being planned?

A (Cantrill) Currently, DTrace does an excellent job instrumenting applications written in traditional programming languages like C and C++ -- but to use DTrace most effectively, one must often have at least passing familiarity with the application being instrumented. We have addressed this problem in the kernel by introducing providers with stable semantics: the "io" provider makes available probes relating to I/O, the "sched" provider makes available probes relating to CPU scheduling, and so on.

So, the most immediate future direction for DTrace is a mechanism that will allow applications to export their own providers with stable semantics. This will allow the system to be instrumented in ways that reflect the system's semantics, allowing one to tie together activity in an application (say, the start of a database transaction) with activity elsewhere in the system (say, the I/O induced by that transaction) without having to know the implementation of either the database or the operating system.

In the more indefinite future, we want to extend DTrace to be able to instrument dynamic languages like Java, Perl, Python, PHP, etc. This is not easy: these languages have not been designed or implemented with dynamic instrumentation in mind, and the techniques required to instrument them are often very specific to a particular language and its run-time environment. Given the high level of language specificity in any instrumentation, any solution will likely provide a mechanism for the implementers of these languages to build their own DTrace providers.

Finally, we want to use the foundation that we have in DTrace to build novel system visualization tools. We view system visualization as an area that has been woefully under explored -- in part for lack of the depth of instrumentation that DTrace provides. We believe that many people outside of Sun are going to have interesting ideas here, so our primary focus is to develop the bindings to allow DTrace consumers to be implemented in Java or Perl -- languages that allow for rapid implementation of visualization tools. As long as there is a portion of the system that cannot be instrumented, or as long as that instrumentation is thought of as overly tedious, specialized, or difficult, there is still work to be done on DTrace!

Q What was the motivation for N1 Grid Containers? What are the future features of N1 Grid Containers? For example, I think it would be especially useful to be able to store a container separately from the operating system disk, enabling the movement of containers between systems. Is such a feature possible and planned?

A (Tucker) The goal of containers is to make it easy to improve system utilization by running multiple applications on the same system, with each application isolated from the rest. Rather than dedicating a single machine for each application, customers can consolidate multiple workloads onto the same box and reduce hardware and administration costs. This also allows them to dynamically adjust resources to accommodate changes in application requirements and load. We're planning to enhance this feature by enabling containers to be migrated between systems, with application data stored on a file server or shared disk. This will make Solaris an even more flexible and powerful platform for application deployment.

Q The user community is very excited about the ZFS feature set, but there seems to be some conflicting information. Perhaps you can straighten that out for us? Specifically, can ZFS be used as the boot disk file system? Is it expected that ZFS will replace UFS within Solaris in the long term? Is it reasonable to use ZFS for holding database files?

A (Ahrens) Yes, you will be able to use ZFS as the boot file system. This feature may not be available in the first release of ZFS, but Sun is aggressively working to make it available as soon as possible. We expect most customers will eventually transition to use ZFS as their general-purpose file system, and we're working to transition all file storage-related tools within Solaris (e.g., install, boot, system management tools) to work with ZFS. That said, we also expect to support UFS for the foreseeable future.

Additionally, ZFS will work well with database files. It has a number of features that simplify the administration of databases and enhance their performance. For example, ZFS has unlimited snapshots and clones, which are a form of writable snapshots. It has end-to-end data integrity using 64-bit checksums. It also supports different block sizes for each file, so it can match the file's block size to the database's block size. Because ZFS is a copy-on-write file system, it turns a database's random logical writes into sequential writes to disk.

Q (follow-up) I've heard from other sources that ZFS may not be as well performing as other file systems for database use. Care to comment on that?

A (Ahrens) We'll have to wait until ZFS is complete before making any direct performance comparisons. That said, excellent database performance is one of our design goals for ZFS. As a copy-on-write file system, ZFS turns writes to random logical offsets into physically contiguous writes.

Transaction processing often requires databases to perform writes to random logical offsets, so in this situation ZFS will typically perform better than traditional, statically laid-out file systems.

Q Please explain how users will be able to run Linux binaries on Solaris 10. Will any special steps need to be taken, or can a binary be copied to the system and run? How compatible do you expect Linux binaries to be with this feature? What do you expect that the performance will be of a Linux binary running on Solaris 10?

A (Goodheart) The new Solaris 10 feature codenamed "Project Janus" implements the required elements of a Linux application's run-time environment within the Solaris kernel. As a result, Linux applications can run unmodified on Solaris. This is an integrated feature of Solaris 10, but it will be turned off by default. Users can turn on this feature and complete the required setup using the Sun-supplied configuration and install tools.

Once the Linux run-time environment is set up on Solaris, the user simply runs the application installation program to install it on the system. This step is required due to the different types of binaries that exist -- statically linked binaries or dynamically linked binaries.

For statically linked binaries the executable file is a standalone entity and would work just by copying the binary to Solaris. Dynamically bound binaries, which are more common, depend on numerous shared libraries that need to be in place before the binary can run. The application install process ensures that required libraries are in place and should hence be used to install the Linux application on Solaris, too.

Any Linux binary or application will run, but one has to remember that Solaris is not Linux. Therefore, if the application uses some features of Linux that Solaris does not provide for or support, then the application will fail for that reason.

The design goal in developing the Linux feature was to ensure that our implemented Linux system calls are no more than 5 percent slower than their Solaris counterparts. Thus, overall the Linux feature performs at least equally, in some cases no more than 5 percent slower than Solaris and in other cases, where mapped Solaris calls are faster than the native Linux call, performs better than Linux.

Q 10-Gb Ethernet seems to be the next generation networking solution. How does FireEngine position Solaris to be able to integrate with and take advantage of 10-Gb Ethernet?

A (Tripathi) We are able to virtualize a 10-Gb NIC into multiple virtual pipes (based on number of Rx ring buffers) and then tie each of those virtual NICs of smaller bandwidth to a squeue/CPU pair. The squeue/CPU is able to control the rate of interrupts and packet arrival based on its own backlog by dynamically switching between interrupt and polling mode. This allows us to effectively drive a 10-Gb NIC with lower CPU utilization and maintain flow affinity to a CPU.

Q (follow-up) Can you provide any performance information about just how fast the new TCP stack can drive 1-Gb and 10-Gb interfaces?

A (Tripathi) We can drive 1-Gb NIC with less than 10 percent of a single Opteron processor. We can drive a 10-Gb NIC at close to 7 Gbps (limited by the PCI-X bus) with 45% of a 2-CPU Sun Fire v20z. The 10-Gb work is still in progress, and we are expecting to drive a 10-Gb NIC with less than 30 percent of the same Sun Fire v20z by the time Solaris 10 ships.

Q Given the number of users who have taken advantage of Software Express for Solaris, do you agree that Solaris 10 is the most tested Sun operating system release ever? How do you think that will affect the rapidity of Solaris 10 field adoption?

A (Shapiro) I think it's the most beta-tested release ever of Solaris, without a doubt. This really is the result of two factors: the success of the Software Express continuous public early access program and the large number of compelling, innovative new features you can't find in any other OS. The success of Solaris Express as the Solaris 10 beta (more than 500,000 installs and counting) will undoubtedly speed Solaris 10 adoption because it has allowed customers to plan deployments using Solaris 10 features. It allowed customers to see, for real, that Solaris 10 isn't hype: these features are real; they are rock solid, and they deliver real innovation.

Q (follow-up) Can we assume Software Express will be used for future Solaris releases, as well as other Sun software releases? For example, will the next release after Solaris 10 start being available as soon as the Solaris 10 beta finishes?

A (Shapiro) Software Express is a mechanism for giving customers access to future Sun software products while they are under development. Our intent has always been to expand the program beyond Solaris and that is something you will see in the future. Shortly after the release of Solaris 10, Software Express for Solaris will be available based on the next Solaris release under development.

Q Solaris 10 has many great new features. My feeling is that Solaris 10 on Opteron will be a potent and compelling solution. My worry is that the ISVs will be slow to support applications on Solaris 10 x86 and that will limit its usability and market penetration. What steps is Sun taking to assure application availability at Solaris 10 release of both SPARC and x86 versions?

A (Tripathi) On the contrary, ISVs and IHVs are working with Sun to support Solaris 10. Take 10-Gb NIC vendors, for instance -- because of FireEngine's unique architecture, there is a ton of interest to support Solaris 10 and have their drivers bundled with Solaris. For example, both S2io and Chelsio are working with Sun to optimize their TOE/RDMA technologies for the new high-performance TCP/IP architecture in Solaris, to enhance performance and scalability in intense compute/server environments.

Similarly, more than 100 ISVs have begun preparing and testing their applications to support Solaris 10 -- from BEA, BMC, and CA to Hyperion, Informatica, Sybase, VERITAS, and Oracle. Sun has an iForce Partner Program/Solaris 10 Early Adoption Program to help ISVs, IHVs, and development partners adopt the newest features and technologies in Solaris (http://iforce.sun.com/partners/solaris).

Q (follow-up) Does this interest from IHVs bode well for having a large set of hardware devices available for Solaris 10 on Opteron?

A (Tripathi) Absolutely. If you look at the networking space, we are seeing a very heavy IHV interest in supporting 1-Gb and 10-Gb NICs, iSCSI cards, TOE cards, etc. on Solaris on Opteron. I think Solaris 10 on Opteron will have a very good IHV support. The customers are demanding Solaris 10 on Opteron by name and that is driving a tremendous interest, and IHVs are starting to see us as a volume play on low end.

Q Clearly, Sun is adding more features to each Solaris release. What do you feel is the most important and useful new feature of Solaris 10?

A (Shapiro) I think it's somewhat unfair to pick just one feature. This release is unlike any other release of Solaris in that we have never had such a large collection of new innovations in one release -- DTrace, Containers, ZFS, Predictive Self-Healing, Process Rights/Least Privilege, Janus, FireEngine TCP/IP, and a 64-bit OS for x86/AMD systems. If you look at Solaris 2.0 to Solaris 9, any one of those features would have been the most important and useful thing to happen in any of those releases.

A (Tripathi) The answer to this question really depends on the customers' needs. FireEngine, the new thread library and chip multi-threading technology are best suited to address price/performance or performance issues. For ISPs, N1 Grid containers will help manage resources and increase security through isolation. Similarly for application developers and ISVs, DTrace is a powerful diagnostic tool to help reduce cost and complexity. I believe Solaris 10 offers something important for every user.

Q When I talk with Sun users about the new Solaris 10 features, almost universally they comment "that's excellent, but why didn't it come out 2 years ago?" Can you discuss the effort that goes in to making a new Solaris release and of adding a major new feature to Solaris 10? Also, is the sheer number of new features in Solaris 10 an indicator that the rate of feature addition to Solaris is accelerating?

A (Shapiro) It's important to understand two things here: one is that innovation on the scale of Solaris 10 features requires more time than people think. Most of the major S10 features took around 2 to 3 years of design and engineering, and some of them were actually in the planning stages years before then. OS development is a long-term endeavor where we must build and re-build our world in an iterative fashion. Most of the technologies in Solaris 10 simply could not have been done without the work done in previous Solaris releases and without having a highly scalable, highly reliable core kernel infrastructure that we've been enhancing over the years.

There are different kinds of features that we add over time, varying between large new subsystems and tools and then a long series of extensions to them. In Solaris 10, you see a large number of new, innovative subsystems to address problems that we've been studying and thinking about for many years. But I don't think that means that we're now accelerating the number of giant subsystems customers will see in 2005-6. Rather, it means that we have put into place the next big pieces of the puzzle and, in doing so, provided the foundation that will allow a lot of rapid, smaller extensions in these areas to occur as new subsystems become richer and more interwoven. I think customers will see many new features following Solaris 10's initial release, but these will be more about extending and enriching the big pieces that come with Solaris 10.

Summary

Thanks again to the engineers who contributed this rich and interesting set of answers to my questions. Gaining such knowledge just whets my appetite to deploy Solaris 10 and to take advantage of its new features.

Peter Baer Galvin (http://www.petergalvin.info) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.