Q&A
with the SolarisTM 10 Engineers
Peter Baer Galvin
Recently, I had the unusual opportunity to ask Sun's operating
system engineering team a variety of questions related to Solaris
10. I found the answers that they provided to be fascinating. I
hope you do, too.
Thanks to:
- Matthew Ahrens, Solaris Engineer
- Bryan Cantrill, Solaris Kernel Development
- Berny Goodheart, Lead Engineer for Project Janus
- Adam Leventhal, Solaris Kernel Engineer
- Mike Shapiro, Solaris Kernel Development
- Sunay Tripathi, Solaris High Performance Networking
- Andrew Tucker, Solaris Zones Architect
Q I think the world is getting the word
about what a powerful tool DTrace is for debugging and performance
tuning Solaris 10 and applications running there. What is the next
big DTrace feature being planned?
A (Cantrill) Currently,
DTrace does an excellent job instrumenting applications written
in traditional programming languages like C and C++ -- but to use
DTrace most effectively, one must often have at least passing familiarity
with the application being instrumented. We have addressed this
problem in the kernel by introducing providers with stable semantics:
the "io" provider makes available probes relating to I/O, the "sched"
provider makes available probes relating to CPU scheduling, and
so on.
So, the most immediate future direction for DTrace is a mechanism
that will allow applications to export their own providers with
stable semantics. This will allow the system to be instrumented
in ways that reflect the system's semantics, allowing one to tie
together activity in an application (say, the start of a database
transaction) with activity elsewhere in the system (say, the I/O
induced by that transaction) without having to know the implementation
of either the database or the operating system.
In the more indefinite future, we want to extend DTrace to be
able to instrument dynamic languages like Java, Perl, Python, PHP,
etc. This is not easy: these languages have not been designed or
implemented with dynamic instrumentation in mind, and the techniques
required to instrument them are often very specific to a particular
language and its run-time environment. Given the high level of language
specificity in any instrumentation, any solution will likely provide
a mechanism for the implementers of these languages to build their
own DTrace providers.
Finally, we want to use the foundation that we have in DTrace
to build novel system visualization tools. We view system visualization
as an area that has been woefully under explored -- in part for
lack of the depth of instrumentation that DTrace provides. We believe
that many people outside of Sun are going to have interesting ideas
here, so our primary focus is to develop the bindings to allow DTrace
consumers to be implemented in Java or Perl -- languages that allow
for rapid implementation of visualization tools. As long as there
is a portion of the system that cannot be instrumented, or as long
as that instrumentation is thought of as overly tedious, specialized,
or difficult, there is still work to be done on DTrace!
Q What was the motivation for N1
Grid Containers? What are the future features of N1 Grid Containers?
For example, I think it would be especially useful to be able to
store a container separately from the operating system disk, enabling
the movement of containers between systems. Is such a feature possible
and planned?
A (Tucker) The goal of containers
is to make it easy to improve system utilization by running multiple
applications on the same system, with each application isolated
from the rest. Rather than dedicating a single machine for each
application, customers can consolidate multiple workloads onto the
same box and reduce hardware and administration costs. This also
allows them to dynamically adjust resources to accommodate changes
in application requirements and load. We're planning to enhance
this feature by enabling containers to be migrated between systems,
with application data stored on a file server or shared disk. This
will make Solaris an even more flexible and powerful platform for
application deployment.
Q The user community is very excited
about the ZFS feature set, but there seems to be some conflicting
information. Perhaps you can straighten that out for us? Specifically,
can ZFS be used as the boot disk file system? Is it expected that
ZFS will replace UFS within Solaris in the long term? Is it reasonable
to use ZFS for holding database files?
A (Ahrens) Yes, you will
be able to use ZFS as the boot file system. This feature may not
be available in the first release of ZFS, but Sun is aggressively
working to make it available as soon as possible. We expect most
customers will eventually transition to use ZFS as their general-purpose
file system, and we're working to transition all file storage-related
tools within Solaris (e.g., install, boot, system management tools)
to work with ZFS. That said, we also expect to support UFS for the
foreseeable future.
Additionally, ZFS will work well with database files. It has a
number of features that simplify the administration of databases
and enhance their performance. For example, ZFS has unlimited snapshots
and clones, which are a form of writable snapshots. It has end-to-end
data integrity using 64-bit checksums. It also supports different
block sizes for each file, so it can match the file's block size
to the database's block size. Because ZFS is a copy-on-write file
system, it turns a database's random logical writes into sequential
writes to disk.
Q (follow-up) I've heard
from other sources that ZFS may not be as well performing as other
file systems for database use. Care to comment on that?
A (Ahrens) We'll have to
wait until ZFS is complete before making any direct performance
comparisons. That said, excellent database performance is one of
our design goals for ZFS. As a copy-on-write file system, ZFS turns
writes to random logical offsets into physically contiguous writes.
Transaction processing often requires databases to perform writes
to random logical offsets, so in this situation ZFS will typically
perform better than traditional, statically laid-out file systems.
Q Please explain how users will
be able to run Linux binaries on Solaris 10. Will any special steps
need to be taken, or can a binary be copied to the system and run?
How compatible do you expect Linux binaries to be with this feature?
What do you expect that the performance will be of a Linux binary
running on Solaris 10?
A (Goodheart) The new Solaris
10 feature codenamed "Project Janus" implements the required elements
of a Linux application's run-time environment within the Solaris
kernel. As a result, Linux applications can run unmodified on Solaris.
This is an integrated feature of Solaris 10, but it will be turned
off by default. Users can turn on this feature and complete the
required setup using the Sun-supplied configuration and install
tools.
Once the Linux run-time environment is set up on Solaris, the
user simply runs the application installation program to install
it on the system. This step is required due to the different types
of binaries that exist -- statically linked binaries or dynamically
linked binaries.
For statically linked binaries the executable file is a standalone
entity and would work just by copying the binary to Solaris. Dynamically
bound binaries, which are more common, depend on numerous shared
libraries that need to be in place before the binary can run. The
application install process ensures that required libraries are
in place and should hence be used to install the Linux application
on Solaris, too.
Any Linux binary or application will run, but one has to remember
that Solaris is not Linux. Therefore, if the application uses some
features of Linux that Solaris does not provide for or support,
then the application will fail for that reason.
The design goal in developing the Linux feature was to ensure
that our implemented Linux system calls are no more than 5 percent
slower than their Solaris counterparts. Thus, overall the Linux
feature performs at least equally, in some cases no more than 5
percent slower than Solaris and in other cases, where mapped Solaris
calls are faster than the native Linux call, performs better than
Linux.
Q 10-Gb Ethernet seems to be the
next generation networking solution. How does FireEngine position
Solaris to be able to integrate with and take advantage of 10-Gb
Ethernet?
A (Tripathi) We are able
to virtualize a 10-Gb NIC into multiple virtual pipes (based on
number of Rx ring buffers) and then tie each of those virtual NICs
of smaller bandwidth to a squeue/CPU pair. The squeue/CPU is able
to control the rate of interrupts and packet arrival based on its
own backlog by dynamically switching between interrupt and polling
mode. This allows us to effectively drive a 10-Gb NIC with lower
CPU utilization and maintain flow affinity to a CPU.
Q (follow-up) Can you provide
any performance information about just how fast the new TCP stack
can drive 1-Gb and 10-Gb interfaces?
A (Tripathi) We can drive
1-Gb NIC with less than 10 percent of a single Opteron processor.
We can drive a 10-Gb NIC at close to 7 Gbps (limited by the PCI-X
bus) with 45% of a 2-CPU Sun Fire v20z. The 10-Gb work is still
in progress, and we are expecting to drive a 10-Gb NIC with less
than 30 percent of the same Sun Fire v20z by the time Solaris 10
ships.
Q Given the number of users who
have taken advantage of Software Express for Solaris, do you agree
that Solaris 10 is the most tested Sun operating system release
ever? How do you think that will affect the rapidity of Solaris
10 field adoption?
A (Shapiro) I think it's
the most beta-tested release ever of Solaris, without a doubt. This
really is the result of two factors: the success of the Software
Express continuous public early access program and the large number
of compelling, innovative new features you can't find in any other
OS. The success of Solaris Express as the Solaris 10 beta (more
than 500,000 installs and counting) will undoubtedly speed Solaris
10 adoption because it has allowed customers to plan deployments
using Solaris 10 features. It allowed customers to see, for real,
that Solaris 10 isn't hype: these features are real; they are rock
solid, and they deliver real innovation.
Q (follow-up) Can we assume
Software Express will be used for future Solaris releases, as well
as other Sun software releases? For example, will the next release
after Solaris 10 start being available as soon as the Solaris 10
beta finishes?
A (Shapiro) Software Express
is a mechanism for giving customers access to future Sun software
products while they are under development. Our intent has always
been to expand the program beyond Solaris and that is something
you will see in the future. Shortly after the release of Solaris
10, Software Express for Solaris will be available based on the
next Solaris release under development.
Q Solaris 10 has many great new
features. My feeling is that Solaris 10 on Opteron will be a potent
and compelling solution. My worry is that the ISVs will be slow
to support applications on Solaris 10 x86 and that will limit its
usability and market penetration. What steps is Sun taking to assure
application availability at Solaris 10 release of both SPARC and
x86 versions?
A (Tripathi) On the contrary,
ISVs and IHVs are working with Sun to support Solaris 10. Take 10-Gb
NIC vendors, for instance -- because of FireEngine's unique architecture,
there is a ton of interest to support Solaris 10 and have their
drivers bundled with Solaris. For example, both S2io and Chelsio
are working with Sun to optimize their TOE/RDMA technologies for
the new high-performance TCP/IP architecture in Solaris, to enhance
performance and scalability in intense compute/server environments.
Similarly, more than 100 ISVs have begun preparing and testing
their applications to support Solaris 10 -- from BEA, BMC, and CA
to Hyperion, Informatica, Sybase, VERITAS, and Oracle. Sun has an
iForce Partner Program/Solaris 10 Early Adoption Program to help
ISVs, IHVs, and development partners adopt the newest features and
technologies in Solaris (http://iforce.sun.com/partners/solaris).
Q (follow-up) Does this interest
from IHVs bode well for having a large set of hardware devices available
for Solaris 10 on Opteron?
A (Tripathi) Absolutely.
If you look at the networking space, we are seeing a very heavy
IHV interest in supporting 1-Gb and 10-Gb NICs, iSCSI cards, TOE
cards, etc. on Solaris on Opteron. I think Solaris 10 on Opteron
will have a very good IHV support. The customers are demanding Solaris
10 on Opteron by name and that is driving a tremendous interest,
and IHVs are starting to see us as a volume play on low end.
Q Clearly, Sun is adding more features
to each Solaris release. What do you feel is the most important
and useful new feature of Solaris 10?
A (Shapiro) I think it's
somewhat unfair to pick just one feature. This release is unlike
any other release of Solaris in that we have never had such a large
collection of new innovations in one release -- DTrace, Containers,
ZFS, Predictive Self-Healing, Process Rights/Least Privilege, Janus,
FireEngine TCP/IP, and a 64-bit OS for x86/AMD systems. If you look
at Solaris 2.0 to Solaris 9, any one of those features would have
been the most important and useful thing to happen in any of those
releases.
A (Tripathi) The answer
to this question really depends on the customers' needs. FireEngine,
the new thread library and chip multi-threading technology are best
suited to address price/performance or performance issues. For ISPs,
N1 Grid containers will help manage resources and increase security
through isolation. Similarly for application developers and ISVs,
DTrace is a powerful diagnostic tool to help reduce cost and complexity.
I believe Solaris 10 offers something important for every user.
Q When I talk with Sun users about
the new Solaris 10 features, almost universally they comment "that's
excellent, but why didn't it come out 2 years ago?" Can you discuss
the effort that goes in to making a new Solaris release and of adding
a major new feature to Solaris 10? Also, is the sheer number of
new features in Solaris 10 an indicator that the rate of feature
addition to Solaris is accelerating?
A (Shapiro) It's important
to understand two things here: one is that innovation on the scale
of Solaris 10 features requires more time than people think. Most
of the major S10 features took around 2 to 3 years of design and
engineering, and some of them were actually in the planning stages
years before then. OS development is a long-term endeavor where
we must build and re-build our world in an iterative fashion. Most
of the technologies in Solaris 10 simply could not have been done
without the work done in previous Solaris releases and without having
a highly scalable, highly reliable core kernel infrastructure that
we've been enhancing over the years.
There are different kinds of features that we add over time, varying
between large new subsystems and tools and then a long series of
extensions to them. In Solaris 10, you see a large number of new,
innovative subsystems to address problems that we've been studying
and thinking about for many years. But I don't think that means
that we're now accelerating the number of giant subsystems customers
will see in 2005-6. Rather, it means that we have put into place
the next big pieces of the puzzle and, in doing so, provided the
foundation that will allow a lot of rapid, smaller extensions in
these areas to occur as new subsystems become richer and more interwoven.
I think customers will see many new features following Solaris 10's
initial release, but these will be more about extending and enriching
the big pieces that come with Solaris 10.
Summary
Thanks again to the engineers who contributed this rich and interesting
set of answers to my questions. Gaining such knowledge just whets
my appetite to deploy Solaris 10 and to take advantage of its new
features.
Peter Baer Galvin (http://www.petergalvin.info) is the
Chief Technologist for Corporate Technologies (www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the
systems manager for Brown University's Computer Science Department.
He has written articles for Byte and other magazines, and
previously wrote Pete's Wicked World, the security column, and Pete's
Super Systems, the systems management column for Unix Insider
(http://www.unixinsider.com). Peter is coauthor of the Operating
Systems Concepts and Applied Operating Systems Concepts
textbooks. As a consultant and trainer, Peter has taught tutorials
and given talks on security and systems administration worldwide. |