Solaris^TM Resource Management -- The Fair Share Scheduler

Peter Baer Galvin

The previous Solaris^TM Companion began coverage of the new Solaris 9 resource management features. This month continues the analysis by exploring the core of S9RM, the fair share scheduler. This scheduler provides precise control of CPU use, allowing optimum system resource use without conflict between applications and workloads.

The fair share scheduler (FSS) was available as an unbundled product before Solaris 9, and is included in the full Solaris 9 release at no charge. The Solaris 9 Resource Management package includes resource limits (discussed here previously), the FSS, resource pools, and the Internet protocol quality of service (IPQoS) facility (to be covered in the future). This suite is one of the key features of Solaris 9, along with the inclusion of SunScreen, new device support, and performance and reliability improvements. Solaris 9 is in the early adoption phase at most sites, and I hope this column will help Sun sites determine if and when to move to Solaris 9, and which features to adopt as part of the move or later in the Solaris 9 lifecycle.

Overview

Fair share scheduling is a concept dating from before 1988, when research into this scheduling concept resulted in a paper by J. Kay and P. Lauder being published in the Communications of the ACM. That paper and other information are available from http://citeseer.nj.nec.com/kay88fair.html. Publishing a research paper is a far cry from use in commercial operating systems and, in fact, few fair share schedulers have been implemented. They are not a general replacement for time-sharing schedulers, which are demand-based. Time sharing schedulers try to maximize the use of system resources, and tend to allocate CPU slices to anyone who wants them. They figure a process wants to be a "hog", so be it, at least the system is being used fully!

With the advent of utility computing and Sun's drive toward the N1 architecture supporting on-demand service creations and destruction, more refined control of CPU use is required. If a facility runs N database servers for M different projects, they must have fine-grained control of resource use available. Enter the FSS, which is another loadable scheduler class within Solaris. In fact, it can coexist with the time-share scheduler, real-time scheduler, interactive scheduler, and fixed-priority scheduler classes already available on Solaris. Details of mixing use of these classes are beyond the scope of this column, but are available in the Sun documentation.

Functionality

In general, the FSS allows projects, tasks, and processes to be assigned "shares" of the system CPUs. There is no available or maximum share value. Rather, shares are relative. If one project has a share of 6 and the only other project has a share of 2, the first project will receive 6/8ths of the CPU cycles, and the second would receive 2/8ths.

The reality of the FSS is that projects (which contain tasks, which contain processes) can be determined by the systems manager to deserve some relative amount of the CPU cycles of a system, and that resource availability is guaranteed. If a system is funded by multiple groups, for instance, those groups can receive the amount of CPU corresponding to their contribution.

Of course, there are other ways to limit the amount of CPU use by processes. Processor sets is an absolute method. A process is assigned to a set of CPUs, and cannot run on other CPUs. Nor can other processes run on those CPUs. (As an aside, resource pools are a Solaris 9 augmentation of processor sets. If you are using sets, check out the pools.)

FSS is a little more flexible than that. If CPU resources are unused, anyone can use them. However, if a certain project needs resources, and deserves those resources, then the FSS will schedule that project to use those CPU resources and disallow use by anything else while that project is using them. Thus FSS is a very nice combination of maximizing system resource use, but assuring that CPU resources are available via a pre-determined ratio when resource demand is high.

Domains, processor sets, and the FSS can be used in combination to slice and dice a system to exactly match the site's needs. Domains are separately booted environments within a Sun server and do not affect each other. Processor sets exist within one operating system image and create a wall preventing use of CPUs by disparate processes. FSS could be used within a domain or a processor set to further refine which processes gain priority over which others. This combination provides a very complete set of CPU-use management tools.

Utility

First, the FSS scheduling class needs to be set as the default scheduler for the system:

# dispadmin -d FSS

This change is permanent (until another class is declared to be the default), and system processes will be placed into the FSS scheduler from now on after each reboot. At this point none of the existing processes are in the new class. Rather, they are still in their old classes or the Solaris default class of TS. To move all of the existing TS processes to FSS, use:

# priocntl -s -c FSS -i class TS

And now check the scheduling classes that are in use:

# ps -ef -o pset,class | grep -v CLS | sort | uniq
  -   IA
  -   TS
  -  FSS
  -  SYS

The use of the FSS is based on a few commands and configuration files. The project.cpu-shares property in the /etc/project file adds share information to the project information typically contained there. Projects not given shares here are assigned a default share value of 1. Note that any processes with shares of 0 are starved of CPU use while any projects with shares greater than 0 are running.

In the following example of /etc/project, any tasks or processes in testproject are assigned a share of 10:

testproject:100::::project.cpu-shares=(privileged,10,none)

Note that only pertinent processes started after this value is in place would receive that share value. The projects file is only read when a project is instantiated or a process joins an existing project. To change the share of all processes in test-project to 3 while they execute, use prctl:

# prctl -r -n project.cpu-shares -v 3 -i project testproject

This command only has a temporary effect, as it does not change /etc/project. The value specified by the -v option can be from zero to 65535. All system processes run in project 0, which is given the maximum value of shares.

How does the FSS perform in the real world? Consider a system with three new projects defined:

# more /etc/project
system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::
lowprioproject:11:For testing:pbg::project.cpu-shares=(privileged,0,none)
medprioproject:12:For testing:pbg::project.cpu-shares=(privileged,1,none)
highprioproject:13:For testing:pbg::project.cpu-shares=(privileged,10,none)

Before any user processes are started, prstat shows an unused system with only the standard projects in use:

PROJID    NPROC  SIZE   RSS MEMORY      TIME  CPU PROJECT
     1        4   13M   11M   2.3%   0:00:01 0.2% user.root
     3       33  178M   98M    20%   0:00:07 0.1% default
     0       42   98M   56M    11%   7:40:22 0.0% system </code>

Now we start a lowprioproject task:

$ newtask -p lowprioproject /usr/tmp/cpuhog &

And view the prstat project information:

PROJID    NPROC  SIZE   RSS MEMORY      TIME  CPU PROJECT
    11        2 1984K 1192K   0.2%   0:02:10 100% lowprioproject
     1        4   13M   11M   2.3%   0:00:01 0.1% user.root
     3       33  178M   98M    20%   0:00:07 0.1% default
     0       42   98M   56M    11%   7:40:22 0.0% system

Total: 82 processes, 148 lwps, load averages: 0.91, 1.07, 1.91

Now we can start a medium-priority project process, and it should get as much CPU as it desires:

 PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
14236 pbg       928K  416K run      1    0   0:02:27  99% cpuhog/1
14230 pbg       928K  416K run      0    0   0:04:18 0.0% cpuhog/1

PROJID    NPROC  SIZE   RSS MEMORY      TIME  CPU PROJECT
    12        1  928K  416K   0.1%   0:02:27  99% medprioproject
     1        5   13M   11M   2.3%   0:00:01 0.2% user.root
     0       42   98M   56M    11%   7:40:22 0.1% system
     3       33  178M   98M    20%   0:00:07 0.0% default
    11        2 1984K 1192K   0.2%   0:04:18 0.0% lowprioproject

Total: 84 processes, 150 lwps, load averages: 1.95, 1.44, 1.84

Note that the load average is 2, because two threads are runnable. The FSS scheduler gives no time slices to the cpuhog running with a share of 0, however.

Next, we can start the high-priority project (with 10 shares). To begin, we confirm that it has 10 shares:

$ newtask -p highprioproject
$ prctl -n project.cpu-shares $$
14265:  sh
project.cpu-shares         [ no-basic no-local-action ]
                           10 privileged none
                        65535 system     deny           [ max ]

Then, we start another "cpuhog" and check the results. As expected, the high priority gets approximately 90% of the CPU, the medium priority gets about 10%, and the low priority gets starved:

$ /usr/tmp/cpuhog &
$ prstat -J -p 14267,14236,14230
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 14267 root      928K  416K run      2    0   0:05:11  90% cpuhog/1
 14236 pbg       928K  416K run      1    0   0:05:22 9.3% cpuhog/1
 14230 pbg       928K  416K run      0    0   0:04:18 0.0% cpuhog/1

PROJID    NPROC  SIZE   RSS MEMORY      TIME  CPU PROJECT
    13        1  928K  416K   0.1%   0:05:11  90% highprioproject
    12        1  928K  416K   0.1%   0:05:22 9.3% medprioproject
    11        1  928K  416K   0.1%   0:04:18 0.0% lowprioproject

Total: 3 processes, 3 lwps, load averages: 3.02, 2.59, 2.23 Summary

Acknowledgement

Special thanks to Andrei Dorofeev, Member of Technical Staff at Sun, for giving input to this column. He also showed me an interesting command for listing how many runnable threads are on a system, including any normally hidden kernel threads:

# mdb -k
B0Loading modules: [ unix krtld genunix ip usba s1394 ufs_log random nfs ptm cpc
 lofs ]
> ::cpuinfo -v
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  0 00001400000  1b    3    0  57   no    no t-1    300051d2fc0 mdb
                  |    |
       RUNNING <--+    +-->  PRI THREAD      PROC
         READY                14 30005176560 cpuhog
        EXISTS                 1 30005176fe0 cpuhog
        ENABLE                 0 300051d3260 cpuhog
> ^D

Summary The Solaris Fair Share Scheduler is part of the Solaris 9 Resource Manager suite. It creates a new scheduler class, and can be used to provide strict control on which processes use how much CPU compared to other projects that use the FSS class. Together with processor sets, processor pools, and domains, minute management of CPU use is now possible. Along with the other resource manager features, the FSS allows control of the use of the Solaris environment that has never before been possible.

You can find out all about the FSS in Sun's documentation:

http://docs.sun.com/db/doc/806-4076/6jd6amqqo?a=view

Experimentation with the new FSS is low risk, but should be done on non-production environments. Enabling and using the FSS does not even require a reboot, so excuses are limited for not giving it a try.

Peter Baer Galvin (http://www.petergalvin.org) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.

SolarisTM Resource Management -- The Fair Share Scheduler

Solaris^TM Resource Management -- The Fair Share Scheduler