Article

aug2004.tar

Computing Environment Crisis Management

Debby Hungerford

Power out? Servers not responding? Phone ringing off the hook? Monitoring systems sending lots of email? If any of these or other obvious signs of trouble are beginning to wreak havoc at your place of work, it's time to reach for that Big Red "Crisis Cookbook" and follow the procedures.

What?

You don't have a big red book? The procedures aren't documented? The systems administration team hasn't been trained on crisis management? Read on...

I am part of an R&D systems administration group, and our crisis management has evolved into a streamlined and well-organized procedure. In this article, I will share some tips about how to be prepared for a crisis.

The best time to prepare for a crisis is before you've had one. But, the best way to hone your crisis procedures is by experiencing a crisis. So, if you don't have enough crises to hone your procedures, you can set up some drills.

The main success factors to great crisis management are (not in any particular order):

1. Solid and evolving procedures

2. Good and simple documentation discipline that is well maintained

3. Central points of entry for work orders (a ticketing system), phone calls (a hotline), and urgent needs (on-call system)

4. "Battle tactics" leadership

5. Timely and succinct communication within the systems administration team to the right external people

6. Good contacts and relationships with departments you depend on, such as facilities and that large Information Systems group that provides network infrastructure and support

7. Checklists with document pointers

8. Responsive and trained people who really care about the customers

I'm going to touch on most of these points and really focus on a couple of them.

In the Engineering Computer Services group at Apple Computer, we implemented two major tools to help us manage in the event of a crisis:

1. A crisis cookbook

2. Outages and post-downtime checklists

Big Red Crisis Cookbook

We have identical, red crisis cookbooks in each of three key locations:

1. The office area

2. Lab

3. Main server room

The crisis cookbook has everything from soup to nuts. Some examples of information include:

1. Contact information for the systems administration group, other key people, and certain vendors (e.g., the vendor with whom we store off-site backup tapes)

2. Downtime information: checklists, notification procedures, bring-up order

3. Important how-to documents: file server quick reference guide, our own failover and creation procedures for key servers and services, restore and snapshot procedures, backups schedule, etc.

4. Hosts files (sorted by hostname, hostfile format, and location)

5. Diagrams of our server room cabinets and contents

6. Terminal server configuration (sorted by port and by hostname)

7. Building maps of areas in which we have users

8. A CD of our documentation tree

9. A pad and pen

A successful crisis cookbook rests on sound documentation procedures. All documents should contain a header and footer. Our headers are a brief description of the document's purpose. Our footers are a standard format, listing level of confidentiality, the pathname of the document, the author's email address, date created, and date(s) modified.

All documentation is reviewed by the systems administration group as it is put into production. Most documents are flat-text files. The author emails the group with a pointer to the document (new or updated), and the manager ensures that it is reviewed by at least one person. Documentation that impacts the entire team is reviewed in a team meeting.

Updates and Tests

How do we keep the cookbook updated, and how does the team stay fresh about the crisis cookbook's contents?

A person in the group is designated as being responsible for maintaining the crisis cookbook. A cron job submits a work order every quarter, reminding us that it needs to be reviewed and possibly updated.

We have a cookbook test every 3 to 6 months. Each team member writes a list of the documents they remember are in the crisis cookbook. They also list documents that they think should be in it but perhaps aren't. Team members get a point for each right answer and extra points if they've identified a document that isn't in the book but should be. The winner gets a prize (like something from the company store that the manager buys), and the loser gets a booby prize. One Halloween, the booby prize was wearing a silly light-up spooky tie. It's important to make the process both fun and worthwhile. We've seen marked improvements in document identification since implementing regular tests.

Probably the most important aspect of the crisis cookbook, and crisis management, is having good checklists.

Our checklist serves several purposes. The outages portion of the checklist gets us started with the following steps:

1. Notifying our management

2. Opening a ticket with INFORMATION SYSTEMS or Facilities if the outage is network or power related

3. Escalating notification in those groups as appropriate

4. Notifying other key people via voicemail if email isn't working

5. Doing walk-arounds

The event leader ensures someone is always there to answer the hotline phone, while one or two designated people work on and track the problem, and then the rest of the team usually does the walk-arounds.

Walk-arounds are really important. If the network or power is down or out, the best way to get information across is human contact. In a walk-around, the systems administrators each take a couple of floors in our two primary buildings and spread the word to as many people as they find. The event leader keeps the roving systems administrators updated by cell phone as to status, and calls them back when everything is supposedly working. At that point, it's time to go into the next phase, which is verifying that the environment is back up and running.

The next phase, verifying the environment, is covered with a downtime checklist. Here are some of the key components of this process:

1. A checklist coordinator is identified to lead the team through the event.

2. We communicate the "all clear" to our management and notify them that we're beginning the verification process.

3. Again, we ensure that someone is manning the phone.

4. We test our work order system.

5. We check all points of entry for work orders or calls from users and process them.

6. We check/clear NFS, printers and plotters, servers, the job management system, CAD tools, data management, backups, licenses, databases, email, shares, cron, MRTG, the Web, etc.

7. We communicate the "all clear" to users.

8. We line up a recovery crew for the evening or next morning if appropriate.

9. We conduct a post mortem or lessons learned. All post mortems are documented, and action items are put into the ticketing system if they last more than a day.

10. The environment bring-up order is also documented on the checklist.

I've been in R&D Unix systems administration and management for more than 20 years VLSI Technology, MIPS/SGI, Octel/Lucent, and now at Apple Computer. I'm here to tell you that bad things are still going to happen, but good crisis management will help you deal with them much more successfully.

Debby Hungerford came up in the trenches as a lone systems administrator at VLSI Technology in the 1980's. She helped develop some of the early standardization and consistency philosophies for R&D Unix environments used in Silicon Valley. Further senior systems administration and management experience was gained by rebuilding the environments and R&D Unix System Administration groups at MIPS Computer Systems and Octel. Debby now works at Apple Computer in the same line of work. She can be contacted at: hungerford@apple.com.