Queuing
Jobs with qjob
Ed Schaefer and John Spurgeon
Most systems administrators are familiar with using the cron facility
or the at command to schedule jobs. Sometimes an attempt
is made to schedule jobs so they don't conflict with one another.
For example, it may be necessary to limit the number of resource-intensive
jobs running at the same time to avoid overloading the system. Or
you may need to prevent jobs from simultaneously accessing a shared
resource. This can be challenging, especially if commands must be
run frequently and the time they take to complete is significant
and variable.
One solution is to create a wrapper script that executes commands
in series. This can solve the problem if there is a way to guarantee
that only one instance of the wrapper script executes at a time.
However, the situation becomes more complicated if some jobs need
to be run more often than others.
We've developed a shell script called qjob (Listing 1), which
places jobs in a FIFO queue and executes them when they are removed
from the queue. This simplifies the scheduling problem and gives
you more flexibility than a wrapper script. With qjob, you can configure
how many jobs are allowed to run at once for a given queue. If only
one job can run at a time, then the queue is analogous to a checkout
lane at a supermarket where all the customers wait while a cashier
services the person at the head of the line. If more than one job
can run at once, then the queue behaves like a single line at a
bank with multiple tellers servicing customers simultaneously.
Options
The following options are supported by qjob:
-j job_name -- job_name is an alias that identifies
the command to be executed. If job_name is not specified,
it defaults to the name of the command.
-c num_clones -- num_clones is the number of times
the same job can be waiting in the queue or running. The default
value is 1. In other words, if a job is currently running or waiting
in the queue, then by default another instance of the same job will
not be allowed in the queue. This can prevent the size of the queue
from growing out of control in case jobs are entering the queue
more rapidly than they are leaving it.
-q queue_name -- Qjob can manage multiple queues. queue_name
identifies a particular queue. If queue_name is not specified,
then "default_queue" is the name of the queue.
-n num_tellers -- The -n num_tellers specifies the
number of jobs that can run simultaneously for a given queue. The
default value is 1.
-d -- The -d option turns on debugging causing qjob
to print informational messages to standard output.
Implementation
The main function in the qjob script is called queue_job. After
qjob processes the command-line arguments, the queue_job function
is called as follows:
queue_job $job_name $queue_name $num_tellers $num_clones "$command"
Within the queue_job function, semaphores are utilized to implement
queues. (See "Implementing Semaphores in the Shell", Sys Admin,
August, 2004.)
The heart of the queue_job function is two nested semaphores:
if semaphore -t 0 -I $num_clones -P ${job_name}_already_queued
then
# ... then $job_name takes a number and waits to be serviced.
# When a teller is free ...
if semaphore -q -s 10 -I $num_tellers -P $queue_name
then
.
.
The outer call uses a semaphore to control the number of clones allowed.
The semaphore name is set by appending the string "_already_queued"
to the job_name variable.
If the outer call is successful, the inner call uses a counting
semaphore with $num_tellers resources to implement a queue. If the
inner call is successful, $command executes when a resource is obtained.
After $command completes, the resources for the two semaphores are
released.
Examples
In the following examples, we use two shell scripts, withdraw.ss
and deposit.ss, to demonstrate how qjobs works. Here is the source
code for the two scripts:
#!/bin/ksh
# withdraw.ss
filename=/tmp/w"$1"
touch $filename
while [[ -f $filename ]]
do
sleep 5
done
exit
#!/bin/ksh
# deposit.ss
filename=/tmp/d"$1"
touch $filename
while [[ -f $filename ]]
do
sleep 5
done
exit
Example 1
Execute qjob with the debug flag set and assign withdraw.ss with
argument 11 to the demo_que queue:
/usr/local/bin/qjob -q demo_que -d withdraw.ss 11 &
Now serving job withdraw.ss from queue demo_que.
Now, assign deposit.ss with argument 22 to the demo_que:
/usr/local/bin/qjob -q demo_que -d deposit.ss 22 &
Executing the semaphore -l command lists the objects assigned
to the queue:
semaphore -l demo_que
returns:
Semaphore demo_que has 1 resource.
Resource 0 is taken by PID 28879.
/bin/ksh /usr/local/bin/qjob -q demo_que -d withdraw.ss 11
Queue for semaphore demo_que:
1. PID 29102 has been waiting since 2005-05-12 08:06:01.
/bin/ksh /usr/local/bin/qjob -q demo_que -d deposit.ss 22
Because the queue has only the default 1 resource, deposit.ss is queued
waiting for withdraw.ss to finish.
Because the number of clones for this resource is the default
1, assigning another instance deposit.ss to demo_que fails:
/usr/local/bin/qjob -q demo_que -d deposit.ss 33 &
Can't put job deposit.ss in queue demo_que.
1 instance of job deposit.ss is waiting in queue demo_que.
Removing /tmp/w11 file terminates the withdraw.ss script, and because
the debug flag is set, the following displays:
Now serving job deposit.ss from queue demo_que.
Again, executing semaphore -l demo_que assures us that deposit.ss
now is running:
Semaphore demo_que has 1 resource.
Resource 0 is taken by PID 13583.
/bin/ksh /usr/local/bin/qjob -q demo_que -d deposit.ss 22
Example 2
Let's demo a bank with two tellers by executing the following
jobs with the -n option set to 2:
/usr/local/bin/qjob -q demo_que -n 2 -d withdraw.ss 11 &
/usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 22 &
Executing semaphore -l produces:
Semaphore demo_que has 2 resources.
Resource 0 is taken by PID 29272.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d withdraw.ss 11
Resource 1 is taken by PID 29437.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 22
Place another deposit in the queue:
/usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 33
and verify with a semaphore -l demo_que that it still waits:
Semaphore demo_que has 2 resources.
Resource 0 is taken by PID 29272.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d withdraw.ss 11
Resource 1 is taken by PID 29437.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 22
Queue for semaphore demo_que:
1. PID 281 has been waiting since 2005-05-15 11:30:31.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 33
Attempting to execute another deposit.ss fails because only one instance
of a job may queue.
Terminating the withdraw.ss job by deleting /tmp/w11 executes
the queued deposit job:
Semaphore demo_que has 2 resources.
Resource 0 is taken by PID 281.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 33
Resource 1 is taken by PID 29437.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -d deposit.ss 22
Example 3
The final example involves allowing the same two tellers, but setting
the number of clones to 2. Execute the following five withdraw.ss
jobs:
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 11 &
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 22 &
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 33 &
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 44 &
/usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 55 &
Executing semaphore -l demo_que displays the following output:
Semaphore demo_que has 2 resources.
Resource 0 is taken by PID 8087.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 11
Resource 1 is taken by PID 8206.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 22
Queue for semaphore demo_que:
1. PID 8358 has been waiting since 2005-05-15 11:51:12.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 33
2. PID 8505 has been waiting since 2005-05-15 11:51:43.
/bin/ksh /usr/local/bin/qjob -q demo_que -n 2 -c 2 -d withdraw.ss 44
Because demo_que now allows two resources, and two clones of the same
job instance, only the last withdraw.ss job is discarded.
What's in the Tarball
In addition to qjob source (Listing 1), we are including an updated
copy of semaphore. The original semaphore contains a bug when using
the queue option. Semaphore wasn't properly limiting the number
of jobs that could be placed in the queue.
References
Schaefer, Ed, and John Spurgeon. 2004. "Implementing Semaphores
in the Shell", Sys Admin 13(8):41-47. -- http://www.samag.com/documents/s=9238/sam0408f/0408f.htm
John Spurgeon is a software developer and systems administrator
for Intel's Factory Integrated Information Systems, FIIS, in Aloha,
Oregon. He is currently training for the Furnace Creek 508 and still
enjoys turfgrass management, triathlons, and spending time with
his family.
Ed Schaefer is a frequent contributor to Sys Admin.
He is a software developer and DBA for Intel's Factory Integrated
Information Systems, FIIS, in Aloha, Oregon. Ed also edits the monthly
Shell Corner column on UnixReview.com. He can be reached at: shellcorner@comcast.net. |