The
Strange New World of the Solaris 10 Service Management Facility
Peter Baer Galvin
Solaris 10 has many new and innovative features. The Service Management
Facility, however, is particularly different from previous Solaris
releases and is core to systems administration, so it deserves some
scrutiny and attention. The first hint that you are in a new world
is a glance at the /etc/rc* directories. The next clue is that killing
a process such as sendmail no longer works. Where are we and why
are we here? Let's take a look at the Solaris 10 Service Management
Facility (SMF).
The Problem
Before the advent of SMF, a booting Solaris system ran the init
daemon, which parsed the /etc/inittab file, which fired off a series
of run control (rc) scripts, depending on the run level the system
was trying to attain. The default run level was "3", multi-user
mode with networking. The inetd daemon spawned other daemons,
as necessary, to provide network services. And all was good. Or
was it?
Life with init, rc scripts, and inetd was less than
pleasant. To change the parameters of a daemon, for example, you
had to determine where the daemon was started and figure out how
to change the parameters associated with the start method. Changing
an rc script was fraught with peril -- one false move, and the system
would fail to boot properly or even hang during booting. Testing
the rc script change meant rebooting the system. Debugging problems
with rc scripts meant turning on debugging options (such as adding
set -x to the script) and rebooting, perhaps multiple times
as fixes were tried. Consider also that the system booted inefficiently
because it marched through the rc scripts sequentially, even if
some of the activities would have worked correctly if done in parallel.
But perhaps the most unappealing aspect of the whole mess was
the hand-created interdependencies and the ramifications if a dependency
failed. For example, the rc scripts had to start the proper components
in the proper order, such that the network interfaces were initialized
before the routing services started, and all that had to be done
before a network daemon started. If one of those components failed
while the system was running, the results were unpredictable and
the problem difficult to debug.
Overview of SMF
All of these issues drove Sun to design an entirely new service
management facility. SMF is part of the one-two punch of the new
Solaris 10 Predictive Self-Healing feature set. (The other component
is the Fault Management Architecture, which I hope to cover in a
future Solaris Companion article.) SMF understands daemons (or services)
and what to do with them. It understands how to start, stop, and
monitor these services. It understands their relations to one other,
which allows it to boot the operating system to a designated run
level much more efficiently. This understanding of dependency also
allows a new level of service functionality -- if a service fails,
SMF can restart that service and all of the services that depended
on it. Thus SMF can fully restore the system to a given run level,
even if a core service fails.
To provide all of these features, SMF needed to be significantly
different from the "olden days" of rc scripts and inetd daemons.
In the remainder of this column, I will delve into the details of
the new world of SMF. I think you'll agree by the end that although
it's a new world, and different, it's better, and it's worth the
effort of getting to know.
Utility
SMF is enabled by default on Solaris 10, so exploration is as
easy as booting an S10 machine. But, be aware that even boot is
affected by SMF. By default, logging during boot is now very quiet.
With the new boot -m verbose option, SMF outputs a line per
service that it's starting, which can help reassure those new to
S10 that everything is working. Gone, however, are the days of grepping
through /var/adm/messages in hopes of finding an error that it is
actually labeled with the name of the service that is having a problem.
Rather, each service has its own persistent log file. These are
in /var/svc/log for the most part, with pre-single-user milestone
service logs in /etc/svc/volatile. The system reaches the "login"
prompt much quicker now, as only the services depended on by login
need to start before login is started. This is just one example
of the advantages of SMF.
Even better than looking through dedicated log files is the ability
to ask SMF about the state of its world. For example, to get an
overview of all services running on the system:
# svcs
legacy_run 19:15:21 lrc:/etc/rcS_d/S50sk98sol
legacy_run 19:15:39 lrc:/etc/rc2_d/S10lu
legacy_run 19:15:40 lrc:/etc/rc2_d/S20sysetup
. . .
online 19:15:05 svc:/system/svc/restarter:default
online 19:15:10 svc:/milestone/name-services:default
online 19:15:11 svc:/network/pfil:default
online 19:15:12 svc:/network/loopback:default
online 19:15:12 svc:/system/filesystem/root:default
online 19:15:14 svc:/system/filesystem/usr:default
online 19:15:16 svc:/platform/i86pc/eeprom:default
online 19:15:16 svc:/system/keymap:default
online 19:15:16 svc:/system/device/local:default
online 19:15:16 svc:/milestone/devices:default
online 19:15:16 svc:/system/filesystem/minimal:default
online 19:15:18 svc:/system/coreadm:default
online 19:15:18 svc:/application/print/cleanup:default
. . .
offline 19:15:10 svc:/application/print/ipp-listener:default
offline 19:15:31 svc:/application/print/rfc1179:default
For information on services that have failed to start:
# svcs -x
svc:/application/print/server:default (LP print server)
State: disabled since Tue Mar 29 19:15:08 2005
Reason: Disabled by an administrator.
See: http://sun.com/msg/SMF-8000-05
See: lpsched(1M)
Impact: 2 dependent services are not running. (Use -v for list.)
And then for detailed information on the services impacted by that
failure:
# svcs -xv
svc:/application/print/server:default (LP print server)
State: disabled since Tue Mar 29 19:15:08 2005
Reason: Disabled by an administrator.
See: http://sun.com/msg/SMF-8000-05
See: man -M /usr/share/man -s 1M lpsched
Impact: 2 dependent services are not running:
svc:/application/print/rfc1179:default
svc:/application/print/ipp-listener:default
Note that services are described by a descriptor URI string called
a Fault Managed Resource Identifier (FMRI). To name a service, the
FMRI needs to be used. It's easiest to think of FMRIs as postfixes
of the full service name. For example, "svc://application/print/rfc1179"
and "rfc1179" are both legal FMRIs for the same service (noting that
the ":default" can be left off). In the previous example, services
"rfc1179" and "ipp-listener" failed to start because the parent process
was disabled.
Note also that SMF is not yet a complete solution. Those services
under state "legacy_run" are actually rc scripts that have been
started by SMF, but that aren't managed by SMF. If one of them fails,
it stays failed. In fact it might already have failed, but would
still be shown as state "legacy_run" by svcs. If one of the SMF-managed
services fails (for example, sendmail), the SMF restart will attempt
to start it, until it starts or until SMF gives up (based on configuration
information). This "legacy_run" provides backward compatibility
as well; any scripts showing up in any of the rc directories is
treated as it was before SMF. But the best way to integrate services
with Solaris 10 is to use SMF to manage them.
Likewise, inetd.conf is still supported for backward compatibility,
but most of those daemons have been converted to services and can
be managed via SMF. One difference is that any changes to inetd.conf
must be followed by the execution of inetconv to convert
the entry into an SMF service. SMF uses a repository stored under
/var/svc/manifest. Each service consists of a manifest, which is
a text file in XML format. The manifest describes everything that
SMF needs to know about the service (or milestone), such as dependencies,
permissions, and command-line options. Changing these files modifies
the behavior of the service (the next time it is restarted or refreshed).
Managing existing services is almost trivial. To disable sendmail,
for example:
# svcadm disable sendmail
# svcs -v sendmail
STATE NSTATE STIME CTID FMRI
disabled - 19:15:08 - svc:/network/smtp:sendmail
SMF changes also persist between reboots, so you no longer have to
rename rc scripts to disable services. If you want the SMF change
to be temporary, the -t option changes the current state but
not the persistent state.
Note that there are several states that a service can occupy.
"Online" means the service is up and running; "offline" means the
service has not yet start or has failed to start; "disabled" means
it is not eligible to run. To see all services, no matter the state,
svcs -a will do the trick.
Other useful commands include svcadm restart to restart
a service, and svcadm refresh, which causes the service to
reread its configuration file (the old kill -HUP). Details
about these states is presented in svcs(5). Using svcs
-d FMRI will show the named service and all of the services
on which it depends. Meanwhile, the -D option shows the services
that depend on the one named.
Milestones were mentioned above but need further description.
A milestone replaces the traditional "run level" in describing the
state of the system and allowing the state to be changed. For example,
the system by default boots to milestone "multi-user", and SMF knows
all of the services that must start for the system to be multi-user
ready. If the system cannot find those services, it does not reach
that milestone. svcadm milestone FMRI transitions the system
to the named milestone, starting or stopping services in the proper
order depending on the actions required to reach the new milestone
from the current one. The desired milestone can also be named via
the boot -m command
Conclusion
Although the new SMF is totally different from the previous boot
and daemon management within Solaris, it includes many welcome changes.
The system boots faster and can recover from errors, such as hardware
failures, that cause services to fail. It allows exact knowledge
of the state of the system and its services, and allows easy management
of those services. Overall, there is a lot to like, with only the
fear of learning something new standing in the way of progress.
Of course, if the new facility isn't learned, causing mayhem within
a Solaris 10 system is a likely outcome. So it's time to role up
your sleeves and make sure you understand the new world before you
are surprised by some new, unknown creature there.
Resources
For more information on SMF, there are documents and forums of
use. BigAdmin is probably the best place to start:
httpd://www.sun.com/bigadmin/content/selfheal
As usual, docs.sun.com contains a wealth of information as
well. Especially of interest is the "System Administration Guide:
Basic Administration". Other information is spread throughout the
systems administration guides.
Peter Baer Galvin (http://www.petergalvin.info) is the
Chief Technologist for Corporate Technologies (www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the
systems manager for Brown University's Computer Science Department.
He has written articles for Byte and other magazines, and
previously wrote Pete's Wicked World, the security column, and Pete's
Super Systems, the systems management column for Unix Insider
(http://www.unixinsider.com). Peter is coauthor of the
Operating Systems Concepts and Applied Operating Systems
Concepts textbooks. As a consultant and trainer, Peter has taught
tutorials and given talks on security and systems administration
worldwide. |