Cover V14, i06

Article

jun2005.tar

The Strange New World of the Solaris 10 Service Management Facility

Peter Baer Galvin

Solaris 10 has many new and innovative features. The Service Management Facility, however, is particularly different from previous Solaris releases and is core to systems administration, so it deserves some scrutiny and attention. The first hint that you are in a new world is a glance at the /etc/rc* directories. The next clue is that killing a process such as sendmail no longer works. Where are we and why are we here? Let's take a look at the Solaris 10 Service Management Facility (SMF).

The Problem

Before the advent of SMF, a booting Solaris system ran the init daemon, which parsed the /etc/inittab file, which fired off a series of run control (rc) scripts, depending on the run level the system was trying to attain. The default run level was "3", multi-user mode with networking. The inetd daemon spawned other daemons, as necessary, to provide network services. And all was good. Or was it?

Life with init, rc scripts, and inetd was less than pleasant. To change the parameters of a daemon, for example, you had to determine where the daemon was started and figure out how to change the parameters associated with the start method. Changing an rc script was fraught with peril -- one false move, and the system would fail to boot properly or even hang during booting. Testing the rc script change meant rebooting the system. Debugging problems with rc scripts meant turning on debugging options (such as adding set -x to the script) and rebooting, perhaps multiple times as fixes were tried. Consider also that the system booted inefficiently because it marched through the rc scripts sequentially, even if some of the activities would have worked correctly if done in parallel.

But perhaps the most unappealing aspect of the whole mess was the hand-created interdependencies and the ramifications if a dependency failed. For example, the rc scripts had to start the proper components in the proper order, such that the network interfaces were initialized before the routing services started, and all that had to be done before a network daemon started. If one of those components failed while the system was running, the results were unpredictable and the problem difficult to debug.

Overview of SMF

All of these issues drove Sun to design an entirely new service management facility. SMF is part of the one-two punch of the new Solaris 10 Predictive Self-Healing feature set. (The other component is the Fault Management Architecture, which I hope to cover in a future Solaris Companion article.) SMF understands daemons (or services) and what to do with them. It understands how to start, stop, and monitor these services. It understands their relations to one other, which allows it to boot the operating system to a designated run level much more efficiently. This understanding of dependency also allows a new level of service functionality -- if a service fails, SMF can restart that service and all of the services that depended on it. Thus SMF can fully restore the system to a given run level, even if a core service fails.

To provide all of these features, SMF needed to be significantly different from the "olden days" of rc scripts and inetd daemons. In the remainder of this column, I will delve into the details of the new world of SMF. I think you'll agree by the end that although it's a new world, and different, it's better, and it's worth the effort of getting to know.

Utility

SMF is enabled by default on Solaris 10, so exploration is as easy as booting an S10 machine. But, be aware that even boot is affected by SMF. By default, logging during boot is now very quiet. With the new boot -m verbose option, SMF outputs a line per service that it's starting, which can help reassure those new to S10 that everything is working. Gone, however, are the days of grepping through /var/adm/messages in hopes of finding an error that it is actually labeled with the name of the service that is having a problem. Rather, each service has its own persistent log file. These are in /var/svc/log for the most part, with pre-single-user milestone service logs in /etc/svc/volatile. The system reaches the "login" prompt much quicker now, as only the services depended on by login need to start before login is started. This is just one example of the advantages of SMF.

Even better than looking through dedicated log files is the ability to ask SMF about the state of its world. For example, to get an overview of all services running on the system:

# svcs
legacy_run     19:15:21 lrc:/etc/rcS_d/S50sk98sol
legacy_run     19:15:39 lrc:/etc/rc2_d/S10lu
legacy_run     19:15:40 lrc:/etc/rc2_d/S20sysetup
. . .
online         19:15:05 svc:/system/svc/restarter:default
online         19:15:10 svc:/milestone/name-services:default
online         19:15:11 svc:/network/pfil:default
online         19:15:12 svc:/network/loopback:default
online         19:15:12 svc:/system/filesystem/root:default
online         19:15:14 svc:/system/filesystem/usr:default
online         19:15:16 svc:/platform/i86pc/eeprom:default
online         19:15:16 svc:/system/keymap:default
online         19:15:16 svc:/system/device/local:default
online         19:15:16 svc:/milestone/devices:default
online         19:15:16 svc:/system/filesystem/minimal:default
online         19:15:18 svc:/system/coreadm:default
online         19:15:18 svc:/application/print/cleanup:default
. . .
offline        19:15:10 svc:/application/print/ipp-listener:default
offline        19:15:31 svc:/application/print/rfc1179:default
For information on services that have failed to start:

# svcs -x
svc:/application/print/server:default (LP print server)
 State: disabled since Tue Mar 29 19:15:08 2005
Reason: Disabled by an administrator.
   See: http://sun.com/msg/SMF-8000-05
   See: lpsched(1M)
Impact: 2 dependent services are not running.  (Use -v for list.)
And then for detailed information on the services impacted by that failure:

# svcs -xv
svc:/application/print/server:default (LP print server)
 State: disabled since Tue Mar 29 19:15:08 2005
Reason: Disabled by an administrator.
   See: http://sun.com/msg/SMF-8000-05
   See: man -M /usr/share/man -s 1M lpsched
Impact: 2 dependent services are not running:
        svc:/application/print/rfc1179:default
        svc:/application/print/ipp-listener:default
Note that services are described by a descriptor URI string called a Fault Managed Resource Identifier (FMRI). To name a service, the FMRI needs to be used. It's easiest to think of FMRIs as postfixes of the full service name. For example, "svc://application/print/rfc1179" and "rfc1179" are both legal FMRIs for the same service (noting that the ":default" can be left off). In the previous example, services "rfc1179" and "ipp-listener" failed to start because the parent process was disabled.

Note also that SMF is not yet a complete solution. Those services under state "legacy_run" are actually rc scripts that have been started by SMF, but that aren't managed by SMF. If one of them fails, it stays failed. In fact it might already have failed, but would still be shown as state "legacy_run" by svcs. If one of the SMF-managed services fails (for example, sendmail), the SMF restart will attempt to start it, until it starts or until SMF gives up (based on configuration information). This "legacy_run" provides backward compatibility as well; any scripts showing up in any of the rc directories is treated as it was before SMF. But the best way to integrate services with Solaris 10 is to use SMF to manage them.

Likewise, inetd.conf is still supported for backward compatibility, but most of those daemons have been converted to services and can be managed via SMF. One difference is that any changes to inetd.conf must be followed by the execution of inetconv to convert the entry into an SMF service. SMF uses a repository stored under /var/svc/manifest. Each service consists of a manifest, which is a text file in XML format. The manifest describes everything that SMF needs to know about the service (or milestone), such as dependencies, permissions, and command-line options. Changing these files modifies the behavior of the service (the next time it is restarted or refreshed).

Managing existing services is almost trivial. To disable sendmail, for example:

# svcadm disable sendmail
# svcs -v sendmail
STATE          NSTATE      STIME    CTID   FMRI
disabled       -           19:15:08      - svc:/network/smtp:sendmail
SMF changes also persist between reboots, so you no longer have to rename rc scripts to disable services. If you want the SMF change to be temporary, the -t option changes the current state but not the persistent state.

Note that there are several states that a service can occupy. "Online" means the service is up and running; "offline" means the service has not yet start or has failed to start; "disabled" means it is not eligible to run. To see all services, no matter the state, svcs -a will do the trick.

Other useful commands include svcadm restart to restart a service, and svcadm refresh, which causes the service to reread its configuration file (the old kill -HUP). Details about these states is presented in svcs(5). Using svcs -d FMRI will show the named service and all of the services on which it depends. Meanwhile, the -D option shows the services that depend on the one named.

Milestones were mentioned above but need further description. A milestone replaces the traditional "run level" in describing the state of the system and allowing the state to be changed. For example, the system by default boots to milestone "multi-user", and SMF knows all of the services that must start for the system to be multi-user ready. If the system cannot find those services, it does not reach that milestone. svcadm milestone FMRI transitions the system to the named milestone, starting or stopping services in the proper order depending on the actions required to reach the new milestone from the current one. The desired milestone can also be named via the boot -m command

Conclusion

Although the new SMF is totally different from the previous boot and daemon management within Solaris, it includes many welcome changes. The system boots faster and can recover from errors, such as hardware failures, that cause services to fail. It allows exact knowledge of the state of the system and its services, and allows easy management of those services. Overall, there is a lot to like, with only the fear of learning something new standing in the way of progress. Of course, if the new facility isn't learned, causing mayhem within a Solaris 10 system is a likely outcome. So it's time to role up your sleeves and make sure you understand the new world before you are surprised by some new, unknown creature there.

Resources

For more information on SMF, there are documents and forums of use. BigAdmin is probably the best place to start:

httpd://www.sun.com/bigadmin/content/selfheal
As usual, docs.sun.com contains a wealth of information as well. Especially of interest is the "System Administration Guide: Basic Administration". Other information is spread throughout the systems administration guides.

Peter Baer Galvin (http://www.petergalvin.info) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.