Cover V14, i08

Article
Listing 1
Sidebar

aug2005.tar

Queuing Jobs with qjob

Ed Schaefer and John Spurgeon

Most systems administrators are familiar with using the cron facility or the at command to schedule jobs. Sometimes an attempt is made to schedule jobs so they don't conflict with one another. For example, it may be necessary to limit the number of resource-intensive jobs running at the same time to avoid overloading the system. Or you may need to prevent jobs from simultaneously accessing a shared resource. This can be challenging, especially if commands must be run frequently and the time they take to complete is significant and variable.

One solution is to create a wrapper script that executes commands in series. This can solve the problem if there is a way to guarantee that only one instance of the wrapper script executes at a time. However, the situation becomes more complicated if some jobs need to be run more often than others.

We've developed a shell script called qjob (Listing 1), which places jobs in a FIFO queue and executes them when they are removed from the queue. This simplifies the scheduling problem and gives you more flexibility than a wrapper script. With qjob, you can configure how many jobs are allowed to run at once for a given queue. If only one job can run at a time, then the queue is analogous to a checkout lane at a supermarket where all the customers wait while a cashier services the person at the head of the line. If more than one job can run at once, then the queue behaves like a single line at a bank with multiple tellers servicing customers simultaneously.

Options

The following options are supported by qjob:

-j job_name -- job_name is an alias that identifies the command to be executed. If job_name is not specified, it defaults to the name of the command.

-c num_clones -- num_clones is the number of times the same job can be waiting in the queue or running. The default value is 1. In other words, if a job is currently running or waiting in the queue, then by default another instance of the same job will not be allowed in the queue. This can prevent the size of the queue from growing out of control in case jobs are entering the queue more rapidly than they are leaving it.

-q queue_name -- Qjob can manage multiple queues. queue_name identifies a particular queue. If queue_name is not specified, then "default_queue" is the name of the queue.

-n num_tellers -- The -n num_tellers specifies the number of jobs that can run simultaneously for a given queue. The default value is 1.

-d -- The -d option turns on debugging causing qjob to print informational messages to standard output.

Implementation

The main function in the qjob script is called queue_job. After qjob processes the command-line arguments, the queue_job function is called as follows:

queue_job $job_name $queue_name $num_tellers $num_clones "$command"
Within the queue_job function, semaphores are utilized to implement queues. (See "Implementing Semaphores in the Shell", Sys Admin, August, 2004.)

The heart of the queue_job function is two nested semaphores:

 if semaphore -t 0 -I $num_clones -P ${job_name}_already_queued
  then
      # ... then $job_name takes a number and waits to be serviced.
      # When a teller is free ...
      if semaphore -q -s 10 -I $num_tellers -P $queue_name
      then
      .
      .
The outer call uses a semaphore to control the number of clones allowed. The semaphore name is set by appending the string "_already_queued" to the job_name variable.

If the outer call is successful, the inner call uses a counting semaphore with $num_tellers resources to implement a queue. If the inner call is successful, $command executes when a resource is obtained. After $command completes, the resources for the two semaphores are released.

Examples

In the following examples, we use two shell scripts, withdraw.ss and deposit.ss, to demonstrate how qjobs works. Here is the source code for the two scripts:

#!/bin/ksh

# withdraw.ss
filename=/tmp/w"$1"
touch $filename
while [[ -f $filename ]]
do
    sleep 5
done
exit

#!/bin/ksh

# deposit.ss
filename=/tmp/d"$1"
touch $filename
while [[ -f $filename ]]
do
    sleep 5
done
exit
Example 1

Execute qjob with the debug flag set and assign withdraw.ss with argument 11 to the demo_que queue:

/usr/local/bin/qjob -q demo_que -d withdraw.ss 11 &

Now serving job withdraw.ss from queue demo_que.
Now, assign deposit.ss with argument 22 to the demo_que:

/usr/local/bin/qjob -q demo_que -d deposit.ss 22 &
Executing the semaphore -l command lists the objects assigned to the queue:

semaphore -l demo_que
returns:

Semaphore demo_que has 1 resource.
  Resource 0 is taken by PID 28879.
  /bin/ksh /usr/local/bin/qjob -q demo_que -d withdraw.ss 11
Queue for semaphore demo_que:
  1. PID 29102 has been waiting since 2005-05-12 08:06:01.
     /bin/ksh /usr/local/bin/qjob -q demo_que -d deposit.ss 22
Because the queue has only the default 1 resource, deposit.ss is queued waiting for withdraw.ss to finish.

Because the number of clones for this resource is the default 1, assigning another instance deposit.ss to demo_que fails:

/usr/local/bin/qjob -q demo_que -d deposit.ss 33 &

Can't put job deposit.ss in queue demo_que.
1 instance of job deposit.ss is waiting in queue demo_que.
Removing /tmp/w11 file terminates the withdraw.ss script, and because the debug flag is set, the following displays:

Now serving job deposit.ss from queue demo_que.
Again, executing semaphore -l demo_que assures us that deposit.ss now is running:

Semaphore demo_que has 1 resource.
  Resource 0 is taken by PID 13583.
  /bin/ksh /usr/local/bin/qjob -q demo_que -d deposit.ss 22
  
Example 2

Let's demo a bank with two tellers by executing the following jobs with the -n option set to 2:

  /usr/local/bin/qjob -q demo_que -n 2 -d withdraw.ss 11 &

  /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 22 &
Executing semaphore -l produces:

Semaphore demo_que has 2 resources.
  Resource 0 is taken by PID 29272.
  /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d withdraw.ss 11
  Resource 1 is taken by PID 29437.
  /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 22
Place another deposit in the queue:

/usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 33
and verify with a semaphore -l demo_que that it still waits:

Semaphore demo_que has 2 resources.
  Resource 0 is taken by PID 29272.
  /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d withdraw.ss 11
  Resource 1 is taken by PID 29437.
  /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 22
Queue for semaphore demo_que:
  1. PID 281 has been waiting since 2005-05-15 11:30:31.
     /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 33
Attempting to execute another deposit.ss fails because only one instance of a job may queue.

Terminating the withdraw.ss job by deleting /tmp/w11 executes the queued deposit job:

Semaphore demo_que has 2 resources.
  Resource 0 is taken by PID 281.
  /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 33
  Resource 1 is taken by PID 29437.
  /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 22
  
Example 3

The final example involves allowing the same two tellers, but setting the number of clones to 2. Execute the following five withdraw.ss jobs:

/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 11 &
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 22 &
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 33 &
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 44 &
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 55 &
Executing semaphore -l demo_que displays the following output:

Semaphore demo_que has 2 resources.
  Resource 0 is taken by PID 8087.
  /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 11
  Resource 1 is taken by PID 8206.
  /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 22
Queue for semaphore demo_que:
  1. PID 8358 has been waiting since 2005-05-15 11:51:12.
     /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 33
  2. PID 8505 has been waiting since 2005-05-15 11:51:43.
     /bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 44
Because demo_que now allows two resources, and two clones of the same job instance, only the last withdraw.ss job is discarded.

What's in the Tarball

In addition to qjob source (Listing 1), we are including an updated copy of semaphore. The original semaphore contains a bug when using the queue option. Semaphore wasn't properly limiting the number of jobs that could be placed in the queue.

References

Schaefer, Ed, and John Spurgeon. 2004. "Implementing Semaphores in the Shell", Sys Admin 13(8):41-47. -- http://www.samag.com/documents/s=9238/sam0408f/0408f.htm

John Spurgeon is a software developer and systems administrator for Intel's Factory Integrated Information Systems, FIIS, in Aloha, Oregon. He is currently training for the Furnace Creek 508 and still enjoys turfgrass management, triathlons, and spending time with his family.

Ed Schaefer is a frequent contributor to Sys Admin. He is a software developer and DBA for Intel's Factory Integrated Information Systems, FIIS, in Aloha, Oregon. Ed also edits the monthly Shell Corner column on UnixReview.com. He can be reached at: shellcorner@comcast.net.