![]() |
||||||||||||||||||||||
![]() ![]() |
![]() |
![]() |
For example, if we take Staircase's average result (Interactive Enabled Compute Off), we get 16.423 seconds, but this value is inside the vanilla kernel range. So, in the long run, this confirms the previous statement that the schedulers perform equally well. The Time Per Request (TPR) and Time Per Request across Concurrent Request (TPRCR) values measure the individual time taken to complete a single request. Let's focus on TPRCR as this indicates how concurrent requests are managed individually. Here, the O(1) scheduler completed individual requests faster, but the Staircase scheduler was closely competitive. The 1.23% difference shows that these two schedulers race tightly to maximize throughput. Note that the Staircase scheduler gave the fastest runtime (TTF, TPR, and TPRCR) with interactive mode and compute mode disabled. This shows that the "burst" award brought no significant advantage on Apache, and it runs fine with constant priority degradation. However, since the difference between enabling interactive mode and disabling it is in the range of the related standard deviation, we can safely use one of those Staircase modes without seeing a significant difference. Enabling compute mode, however, gave the worst results for Staircase scheduler in the TTF, TPR, and TPRCR cases, so it is clear that compute mode should be avoided. Let's move to the MySQL test (see Table 2). Note that during the MySQL test, when enabling compute mode, the test failed 6-7 times. The error received was "failure to connect to MySQL socket". Thus, we simply dropped the compute mode from the benchmark. The major indicator of database performance is how many transactions can be executed within a certain timeframe. More is better. In this case, Tps (Transaction per second) is the statistic we're looking for. On average, the O(1) scheduler beat the Staircase scheduler with a slight difference (2.42%). However, using the same calculation shown for the Apache case, we can say that the schedulers produced equal results in the long run. Staircase performed best when the interactive mode was enabled. This is the opposite of the Apache test results, which showed that disabling the interactive mode worked best. Why is this the case? MySQL sometimes needs to fetch a big block of data from the disk. With one million rows being read and also updated on some occasions, several processes must wait until the data is brought into buffer or actually committed to disk. Combined with the locking scheme for each transaction, this will reduce some degree of concurrency. A client process that is unable to lock the needed rows (to safely perform operations) will go to sleep. Here, the interactivity component plays a major role by increasing burst on awakening the process and causing the process to run longer on highest priority. This makes the current running task that receives a burst bonus (e.g., doing a read query) less likely to be preempted by another process. The result is faster completion time. As always, total time when completing whole jobs is important. The results show that O(1) can complete the whole Sysbench test run faster than Staircase. O(1) needed 137.0962 seconds, whereas Staircase (enabling Interactive mode) needed 140.4485 seconds. To better judge this timing issue, look at the value of AvTr (Average of time per request). On average, the O(1) scheduler took 109.6 ms for each request, and Staircase took 112.3 ms. Here, assuming each single task consumes constant "x" ms, the difference shows the extra time (overhead) taken for context switch (including how fast the scheduler picks new process). Also, keep in mind that a scheduler always picks tasks that are in TASK_RUNNING state. Tasks in TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE mode that must be awakened (back into TASK_RUNNING mode) need extra time to be picked up by the scheduler. A scheduler should try to balance these fairly to avoid heavy starvation among ready-to-run processes and recently awakened processes. Objectively, the O(1) scheduler managed this slightly more efficiently. Latency Faster runtime is everybody's concern, but latency must also be considered. As we know, a single CPU (or single core on a multiple-core CPU) architecture only executes a single process at a time. If you have dual processor, then your computer can execute two processes at a time, and so on. But on a busy server, the running tasks might number in the hundreds or thousands or more. So, the scheduler must pick one task and abandon the others for a certain time slice. If the time slice is expired, another task is picked up, and the previous one is paused. Thus, there is always waiting time between the triggering of a task and actual execution of the task. This waiting time also includes the interval between when a task is paused by the scheduler and when it is re-started. This waiting time is known as latency. For measuring latency, here are the indicators used: 1. For the Apache case, connection time is the base indicator. The Apache benchmark (ab) provides this information in the "Connection Times" section. Here we pick the "Total" row and focus on minimum (min) and maximum (max) time. "Total" time represents the time between when a client initiates a connection and receives the last byte of data from the server (thus closing the whole HTTP session). We also pick a number from "Percentage of the requests..." section to represent a majority of total connection time. It was decided that 95% was a fair number to represent a majority inside the sampled data. 2. For MySQL case, once again we get data from "per-request statistics". Similar to the Apache case, we also grab minimum (min) time, maximum (max) time, and approximation of 95% value (95% value). Again, 95% was considered representative of a majority of per request runtime. In both cases, we calculated the following:
The above numbers are equally weighted. Ideally, there should be near zero difference between them. But since "maximum" time has a probability of being a high number compared to "majority" time, always expect a big difference between the max-min diff and major-min diff. Let's observe Apache first (see Table 3). On latency handling, the Staircase scheduler appears to have improved. With interactive mode and compute mode disabled, the Staircase scheduler lowered the worst-case latency (max-min) value. The 31.73% (1718.91 ms) difference is somewhat surprising, but since Staircase focuses on interactivity, it proves that the algorithm truly works. The major-min difference is also interesting. If we just look at that column, we can quickly see that compute mode is the best mode for Staircase. But, by observing the adjacent column, we see that the max-min difference is very bad. Disabling both interactive mode and compute mode still gives better latency on major-min. Overall, the O(1) scheduler gives the lowest major-min value. Now let's look at the MySQL test (see Table 4). Again, we throw away the compute mode for the same reason it was dropped from the "Completion Time" test previously. So, it is just two modes of the Staircase versus the O(1) scheduler. Once again, with interactive mode and compute mode disabled, the Staircase scheduler won on worst-case latency with 414.42 ms (24.5%). The difference shows that even on heavy operations, such as this MySQL transaction, Staircase kept the latency low. Obviously, huge starvation is not something a user expects when awaiting SQL query results. This is not easy to solve, because, again, locking (either table or row level) might complicate latency. So, don't expect such a low latency if lock or long sql transactions are submitted. In the major-min column, the O(1) scheduler consistently won, just as in the Apache case. This brings us to an early conclusion -- if you really care about worst-case latency, go with the Staircase scheduler. But if you care more about how fast your service serves the clients, the O(1) scheduler might be the better choice. Scalability Now let's use a different number of worker threads on each Apache and MySQL test. This scalability test measures timing trends on each scheduler. In this test, the rules and configuration are still the same; however, there are some changes to note: 1. On both tests (Apache and MySQL), we are pitting the O(1) scheduler against only the non-compute mode of Staircase to simplify analysis. 2. The test is executed using 8 different threads. I chose 8 because I use exponential thread numbers (e.g., 2, 4, 8, and so on), but we are also limited by the available memory to run such a huge number of threads. Thus, this scalability test only provides a limited view of scalability, and we can't predict what happens beyond the tested number of threads. On each thread number (for example, using 4 threads), I ran the test 5 times and calculated the average runtime. 3. We are focusing only on overall completion time, because the most relevant factor in a scalability test is throughput. This is narrowed again on total time taken for all threads to complete, because analyzing individual process time will be overwhelming. For Apache, I used "Time Taken for tests" output of ab, and "Total time" output from Sysbench. Note that scalability on Uniprocessor has vastly different requirements than scalability on increasing numbers of processors, so no conclusion on SMP scalability can be drawn from these results. As previously, we will first analyze the Apache results (see Table 5 and Figure 3) and then the MySQL results. To begin, we must pick a point as the base of scale. "Point" here refers to number of running threads. Ideally, we would pick 4 threads as the base because that's how we started. Note, however, that from 4 to 16 threads, the Staircase scheduler shows decreasing trends. This means that a single CPU is still able to handle all these loads without penalty. Then, from 16 to 256 threads, total runtime is increasing. With this fact in mind, I decided to measure scalability using 16 threads as the base and compare it with results from 128 and 256 threads. The actual calculation to compute the scale is shown in Table 5. Note that we are looking for the lower percentage, not the higher one. A higher number means that runtime is increasing, and we are looking for faster (lower) runtime. Here is the general form of the scalability formula used:
Scalability of B-to-A = (runtime result B / runtime result A ) / (B / A) where B and A are number of threads.
In the 16-to-128 and 16-to-512 cases, Staircase showed better scalability with interactive mode disabled -- 0.34% lower than O(1) scheduler. We can say, therefore, that they scale virtually equally. So, whenever we add 10, 20, or maybe 40 more threads, we can expect the O(1) scheduler and the Staircase to grow similarly. Does this mean Staircase can outperform O(1) scheduler with a larger number of threads? It is hard to say so; by looking at the table, the O(1) scheduler still performs better. However, it is interesting to note that on 64 threads, we see very little difference between them. Between the two Staircase modes, we can see no significant difference (< 0.5%), so there is no significant growth/shrinkage on alternately using these two modes. The MySQL scalability results are shown in Table 6 and Figure 4. Using the same logic as previously covered on Apache scalability test, we start by picking a base point. From 2 to 16 threads, all schedulers show decreasing trends. Then, from 16 to 256 threads, total runtime is increasing. So we use 16 threads as the base and compare it with results from 64 and 256 threads. The actual calculation to compute the scale is shown in Table 6. Table 6 shows that Staircase scaled slightly better (2%) than O(1) scheduler. Since the growth is smaller, you can see that the Sysbench test ran faster on 256 and 128 threads using the Staircase scheduler. So, is the previous conclusion (MySQL completion time section) wrong? Yes and no. You must recall that locking is involved here, and a larger number of threads means serialization will slow down everything to maintain data consistency. Therefore, the scalability of a scheduler under database operation is somewhat unpredictable. As with the Apache case, we can see no significant difference (< 0.7%) between the two Staircase modes, so there is no significant growth/shrinkage on alternately using these two modes. Conclusions Picking an alternative scheduler is always tricky. Remember there is no single bull's-eye solution for scheduling. In this article, I analyzed the Staircase scheduler's performance under a server workload. The O(1) scheduler was chosen as a baseline, because the Staircase represents major modifications upon it -- primarily the removal of expired queue and modification of dynamic priority calculation. Although the O(1) scheduler performed slightly better in both Apache and MySQL tests, the Staircase scheduler shows great promise in server workload (as shown on the latency test). Although the Staircase scheduler was developed primarily for desktop needs, we've seen that it can be the equivalent of the O(1) scheduler in certain cases. Therefore, I recommend testing the Staircase scheduler (and others) in your own configuration to determine which one best suits your needs. Note that I received a last minute note from Con Kolivas that throughput is the primary area addressed after version Staircase version 10.5, so we may expect better results in updated Staircase code in the near future. If you are interested in testing the Staircase scheduler with other useful patches, there are ready-to-use patch sets available from http://kernel.kolivas.org -- one for desktop usage, and another for server usage. From the experiments, here are the important points about Staircase: 1. Interactive mode yields different results on different applications. The rule of thumb is to enable it on applications that are heavy on I/O and disable it on less I/O-intensive tasks. Memory-hogging tasks are not influenced much by interactive mode, but extensive I/O disk tasks benefit from it. 2. Compute mode is not suitable for applications such as databases, because it delays preemption and thus adds latency, a key factor in improving overall throughput. However, it might be useful on almost pure CPU tasks, such as rendering, because compute mode gives a longer time slice, allowing the process to finish faster with less chance of being preempted. 3. If you care about worst-case latency, disabling both interactive mode and compute mode is the best way to go. 4. Disabling and enabling interactive mode makes the performance of these schedulers scale quite similarly. So, if you get equal results with a certain thread number, you will likely get equal results when increasing or decreasing them. Therefore, if you decide that a certain mode of Staircase works best for you, stick with it unless the application characteristics (e.g., table type, amount of cache, HTTP timeout) or hardware specifications (e.g., RAM, L1/L2 cache, more CPU) are changed. Remember that this test was conducted in single-user mode. However, single-user mode is not the one generally used in a production environment. A sys admin usually picks run-level 3 (multi-user mode with network support) and runs ad hoc services such as crond, xinetd and others. This environment might affect how your main applications work. To lessen the effect, always set non-primary tasks to a lower priority (using "nice"). You can also explore the possibility of using SCHED_BATCH priority implemented in the Con Kolivas patchset. Acknowledgements The author thanks God and the following people for the feedback and comments that made this article better (in no particular order): Con Kolivas, Jake Moilanen, Rick Lindsley, Mithlesh Thukral, Amit Shah, R. Krishnakumar, Baruch Evans, Nur Hussein, Josh Aas, Tony Vegan, Muli Ben Yehuda, Piotr Neuman, Philippe Reynes. References 1. Brief explanation on how staircase scheduler works, by Con Kolivas -- http://lwn.net/Articles/87729/ 2. Documentation of Linux kernel scheduler (based on version 2.6.8.1) --http://josh.trancesoftware.com/linux/ 3. Documents inside Linux kernel package explaining O(1) scheduler -- Documentation\sched-design.txt, Documentation\sched-coding.txt 4. MySQL Reference Manual -- http://dev.mysql.com/doc/mysql/en/index.html 5. Schedtool, a tool to query or alter a process's scheduling policy under Linux -- http://freequaos.host.sk/schedtool/ 6. Con Kolivas patchset -- http://kernel.kolivas.org 7. Backport of Con Kolivas patchset on Linux kernel 2.4 -- http://www.plumlocosoft.com/kernel/ 8. Using MySQL to benchmark OS performance -- http://software.newsforge.com/software/04/12/27/1238216.shtml?tid=72&tid=29 9. Comparing MySQL performance --http://www.newsforge.com/article.pl?sid=04/12/27/1243207 Further Reading 1. Love, Robert, M. 2005. Linux Kernel Development, 2nd ed. Novell Press. 2. Linux kernel analysis -- http://tldp.org/HOWTO/KernelAnalysis-HOWTO.html 3. Linux: Benchmarking the Staircase Scheduler -- http://kerneltrap.org/node/3589 4. Description of The staircase scheduler -- http://lwn.net/Articles/87729/ 5. Staircase vs. non-Staircase comparison -- http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html 6. OProfile-kernel profiler -- http://oprofile.sourceforge.net/ 7. Patch: CPU scheduler evaluation tool -- http://lwn.net/Articles/91455/ 8. CPU scheduler evaluation patch -- https://sourceforge.net/projects/cpuse/; http://cpuse.sourceforge.net/ 9. Linux Scheduler Stuff Page -- http://www.xmailserver.org/linux-patches/lnxsched.html Mulyadi Santosa lives on Indonesia and works as a freelance writer and IT consultant. He received his bachelor's degree in informatics engineering. He specializes in clustering and Internet services. He can be contacted at: mulyadi.santosa@gmail.com. |
![]() |
© 2005 CMP Media LLC. All Rights Reserved. |