Computing
Environment Crisis Management
Debby Hungerford
Power out? Servers not responding? Phone ringing off the hook?
Monitoring systems sending lots of email? If any of these or other
obvious signs of trouble are beginning to wreak havoc at your place
of work, it's time to reach for that Big Red "Crisis Cookbook" and
follow the procedures.
What?
You don't have a big red book? The procedures aren't documented?
The systems administration team hasn't been trained on crisis management?
Read on...
I am part of an R&D systems administration group, and our
crisis management has evolved into a streamlined and well-organized
procedure. In this article, I will share some tips about how to
be prepared for a crisis.
The best time to prepare for a crisis is before you've had one.
But, the best way to hone your crisis procedures is by experiencing
a crisis. So, if you don't have enough crises to hone your procedures,
you can set up some drills.
The main success factors to great crisis management are (not in
any particular order):
1. Solid and evolving procedures
2. Good and simple documentation discipline that is well maintained
3. Central points of entry for work orders (a ticketing system),
phone calls (a hotline), and urgent needs (on-call system)
4. "Battle tactics" leadership
5. Timely and succinct communication within the systems administration
team to the right external people
6. Good contacts and relationships with departments you depend
on, such as facilities and that large Information Systems group
that provides network infrastructure and support
7. Checklists with document pointers
8. Responsive and trained people who really care about the customers
I'm going to touch on most of these points and really focus on
a couple of them.
In the Engineering Computer Services group at Apple Computer,
we implemented two major tools to help us manage in the event of
a crisis:
1. A crisis cookbook
2. Outages and post-downtime checklists
Big Red Crisis Cookbook
We have identical, red crisis cookbooks in each of three key locations:
1. The office area
2. Lab
3. Main server room
The crisis cookbook has everything from soup to nuts. Some examples
of information include:
1. Contact information for the systems administration group, other
key people, and certain vendors (e.g., the vendor with whom we store
off-site backup tapes)
2. Downtime information: checklists, notification procedures,
bring-up order
3. Important how-to documents: file server quick reference guide,
our own failover and creation procedures for key servers and services,
restore and snapshot procedures, backups schedule, etc.
4. Hosts files (sorted by hostname, hostfile format, and location)
5. Diagrams of our server room cabinets and contents
6. Terminal server configuration (sorted by port and by hostname)
7. Building maps of areas in which we have users
8. A CD of our documentation tree
9. A pad and pen
A successful crisis cookbook rests on sound documentation procedures.
All documents should contain a header and footer. Our headers are
a brief description of the document's purpose. Our footers are a
standard format, listing level of confidentiality, the pathname
of the document, the author's email address, date created, and date(s)
modified.
All documentation is reviewed by the systems administration group
as it is put into production. Most documents are flat-text files.
The author emails the group with a pointer to the document (new
or updated), and the manager ensures that it is reviewed by at least
one person. Documentation that impacts the entire team is reviewed
in a team meeting.
Updates and Tests
How do we keep the cookbook updated, and how does the team stay
fresh about the crisis cookbook's contents?
A person in the group is designated as being responsible for maintaining
the crisis cookbook. A cron job submits a work order every quarter,
reminding us that it needs to be reviewed and possibly updated.
We have a cookbook test every 3 to 6 months. Each team member
writes a list of the documents they remember are in the crisis cookbook.
They also list documents that they think should be in it but perhaps
aren't. Team members get a point for each right answer and extra
points if they've identified a document that isn't in the book but
should be. The winner gets a prize (like something from the company
store that the manager buys), and the loser gets a booby prize.
One Halloween, the booby prize was wearing a silly light-up spooky
tie. It's important to make the process both fun and worthwhile.
We've seen marked improvements in document identification since
implementing regular tests.
Probably the most important aspect of the crisis cookbook, and
crisis management, is having good checklists.
Our checklist serves several purposes. The outages portion of
the checklist gets us started with the following steps:
1. Notifying our management
2. Opening a ticket with INFORMATION SYSTEMS or Facilities if
the outage is network or power related
3. Escalating notification in those groups as appropriate
4. Notifying other key people via voicemail if email isn't working
5. Doing walk-arounds
The event leader ensures someone is always there to answer the
hotline phone, while one or two designated people work on and track
the problem, and then the rest of the team usually does the walk-arounds.
Walk-arounds are really important. If the network or power is
down or out, the best way to get information across is human contact.
In a walk-around, the systems administrators each take a couple
of floors in our two primary buildings and spread the word to as
many people as they find. The event leader keeps the roving systems
administrators updated by cell phone as to status, and calls them
back when everything is supposedly working. At that point, it's
time to go into the next phase, which is verifying that the environment
is back up and running.
The next phase, verifying the environment, is covered with a downtime
checklist. Here are some of the key components of this process:
1. A checklist coordinator is identified to lead the team through
the event.
2. We communicate the "all clear" to our management and notify
them that we're beginning the verification process.
3. Again, we ensure that someone is manning the phone.
4. We test our work order system.
5. We check all points of entry for work orders or calls from
users and process them.
6. We check/clear NFS, printers and plotters, servers, the job
management system, CAD tools, data management, backups, licenses,
databases, email, shares, cron, MRTG, the Web, etc.
7. We communicate the "all clear" to users.
8. We line up a recovery crew for the evening or next morning
if appropriate.
9. We conduct a post mortem or lessons learned. All post mortems
are documented, and action items are put into the ticketing system
if they last more than a day.
10. The environment bring-up order is also documented on the checklist.
I've been in R&D Unix systems administration and management
for more than 20 years VLSI Technology, MIPS/SGI, Octel/Lucent,
and now at Apple Computer. I'm here to tell you that bad things
are still going to happen, but good crisis management will help
you deal with them much more successfully.
Debby Hungerford came up in the trenches as a lone systems
administrator at VLSI Technology in the 1980's. She helped develop
some of the early standardization and consistency philosophies for
R&D Unix environments used in Silicon Valley. Further senior
systems administration and management experience was gained by rebuilding
the environments and R&D Unix System Administration groups at
MIPS Computer Systems and Octel. Debby now works at Apple Computer
in the same line of work. She can be contacted at: hungerford@apple.com. |