Cover V14, i08
aug2005.tar

Standardize Your Backups with a Decision Point System

Ian Mahaney

Backup and restore policies can be difficult to define and understand. While business priorities shift, new products are introduced, and requirements change, administrators are faced with the task of managing an effective backup solution that spans the entire enterprise. Today's dynamic environments often result in uncertainty regarding the properties of a backup policy and leave IT managers asking why? To simplify the creation of backup policies and answer the question why, administrators can utilize a Decision Point System when determining backup requirements.

The DPS (Decision Point System) is based on a set of tools that have been developed to standardize and simplify the creation of policy definitions. These tools offer an easy approach to determining your backup requirements and the fixed costs associated with those requirements. They also provide consistency when determining which resources are allocated per policy and allow administrators to be proactive in predicting resource utilization over time. Their use creates a level of accountability that can be referenced when IT managers ask why?

The fundamental tools that are the basis of the DPS include Policy Decision Points, Decision Point Values, and Decision Point Policy Tables. Each tool maintains a specific role within the DPS. These roles will be discussed in detail and presented in a real-world context to easily understand their implementation in relation to the DPS.

Before beginning a DPS implementation, administrators must understand the properties of a backup policy. All backup policies have an associated window, run length, frequency, and retention period. These items, in addition to backup size, are the key factors in how many resources, what type of resources, and for how long resources are occupied. They are collectively referred to as policy details. Determining the value of policy details is accomplished through the use of policy decision points.

Policy decision points usually consist of both business- and technology-based rule sets. They are the foundation of the Decision Point System and are always created first in a DPS implementation. Within the rule sets, each decision point is relative to a policy detail. Therefore, they are divided into two categories: Frequency and Retention along with Run Length and Window.

Each decision point should also be assigned a percentage value that will remain constant throughout the DPS implementation. The sum of the percentage values within a category must equal 100. The percentage is typically based on an assumed overall value with consideration for company objectives and product importance. They are normally created through discussions between business owners, such as product managers, and administrators. This is often the most difficult task of a DPS. As such, an example of a decision point set and associated percentage values is provided below.

Decision Points and Their Percentages

Category 1: Frequency and Retention

  • 35% -- Revenue associated with the data being backed up
  • 30% -- Loss projections in the event of a failure
  • 20% -- Timeframe from which data may need to be restored
  • 12% -- Modification and access of data
  • 03% -- Type of data

Category 2: Run Length and Window

  • 40% -- Restore times in the event of a failure
  • 40% -- Business cycle downtime impact
  • 20% -- Performance impact of backup jobs

The policy decision points and their percentages are then used in the calculation of decision point values. They can be applied to existing or future policy definitions. Decision point values ultimately determine the details of a policy definition. They are easily defined by reviewing each decision point and how it relates to the data set to be backed up. This relation is then translated into a numeric value between 1 and 10 and assigned to the decision point for that policy.

A score of 10 indicates that the decision point is critical and should receive the full percentage associated with it. A score of 6 indicates that the decision point has a decreased value for this backup policy and should receive only 60% of the total percentage possible. The numeric value assigned to each decision point within a category is then multiplied by the associated percentage. We then sum the total of all decision point values and multiply by 10 to receive a point total per category. These point totals are then used in referencing the decision point policy tables to determine an appropriate window, run length, frequency, and retention period.

Decision Point Policy Tables

To follow the DPS standard, policy definitions are created using the decision point policy tables. The tables provide a simple alignment of policy details and are arranged in order of category point totals. They are typically based on business logic translated into a technical representation through the use of a risk assessment. To further understand how they are defined, we must look at each one individually. The details of the frequency-retention table are based on the available values within your backup solution and the overall risk assessment of all the data to be backed up within the enterprise. For example, a risk assessment may determine that the most frequently backups should be performed is hourly and the least frequently is weekly.

In this scenario, the hourly value would be associated with the greatest point total while the weekly value would be associated with the lowest value. This logic also holds true for the retention period; however, the retention periods are matched against the frequency values. If the risk assessment has determined that data should be stored for a minimum of 2 weeks and a maximum of 6 months, then these values must be systematically defined in the table while matched against a frequency value (see Table 1). Our risk assessment has determined that are minimum and maximum retention periods are 2 weeks and 6 months, respectively. Because two identical frequency values are defined within the table, we systematically match our retention periods within the table as 2 weeks and 4 weeks or more.

The run-length and window table is defined similarly to the frequency-retention table. However, this table must take into consideration a company's business cycle. The business cycle allows administrators to determine utilization trends over a specific period of time and can usually be depicted as a wave or set of waves with peaks and valleys. Using the risk assessment and business cycle, administrators can define their windows and run-length values within the table. This can be seen in the Table 2. The business cycle has determined that our down periods are between 8 p.m. and 8 a.m., while peak periods are between 8 a.m. and 8 p.m. To follow good practice, critical backups should be performed during non-peak hours. Since critical backups are typically calculated at high category point totals, we assign our non-peak hours to the greatest values within the table and our peak hours to our lowest values. These values should then be spread over the point total rows defined within the table. As with the previous table, we are then able to match run-length values against defined window values.

In this example, it has been determined that the minimum run length of any job during non-peak and peak hours should be 1 hour. This value is then assigned to the greatest category point total during non-peak hours and the greatest category point total during peak hours. It has also been determined that the maximum run-length period should be no longer than 8 hours. Using the range of 1 to 8 hours, we then match the run-length values to the window values as we did with the frequency-retention period. When all values have been determined, the tables are then ready for use. To utilize the tables, an administrator simply locates a data set's category point total within the table and implements the policy detail associated with that category point total in their backup policy. The examples provided in Tables 1 and 2 may be used as guidelines when determining your decision point policy tables.

Calculating decision point values and using the decision point policy tables is essential in standardizing and providing a successful DPS implementation. However, one of the key advantages of standardization and the DPS is the ability to gain a level of accountability for backup policies. Accountability is derived from the use of the policy decision points, decision point values, and decision point policy tables but will be lost without the ability to record this information for historical record, future trending, and analysis.

To record this data and simplify your calculations, a policy detail form is included with the DPS. The policy detail form should include, but is not limited to, policy name, data type, size, decision point values, cumulative category scores, resource equation values, window, current run length, allocated run length, frequency, retention, date modified, and backup administrator.

Using the policy detail form along with the examples provided, we can easily construct real-world scenarios to gain a better understanding of how different decision point values affect the outcome of policy details within the DPS. In the following scenarios, we'll review two data sets and apply the DPS to both. Each case utilizes identical data types but differentiates by access patterns and business importance.

Scenario 1: Production Database

In the first scenario, we'll review a new production database environment. The environment will maintain a set of databases that are to be used for Web form information, application processing, and reporting features. A large number of products will depend on the availability of these databases and the data they maintain. Any downtime within this environment would contribute to significant revenue loss. The total size of the databases is estimated to be approximately 300 GB.

The characteristics mentioned previously provide enough information for administrators to begin a DPS-based policy definition. To do so, we must first review our policy decision points and determine how relative each point is in regard to the data that must be backed up. Because each decision point is outlined in the policy detail form, it is easy for administrators to simply input the values they believe to be valid for a specific decision point. The form shown in Figure 1 depicts the decision point values associated with each point in the DPS implementation. These values have also been used with the decision point policy tables to determine frequency, retention, run length, and window.

Now that the policy detail form has been completed, we can use these values to determine the resource requirements necessary for the policy. To do this, we use a set of resource equations that are provided below:

Equation 1: Storage Utilization

Storage Utilization = (size)(growth% + 1)(frequency)(retention)
Size: Expressed in MB.
Growth%: Expected growth percentage based over time and expressed in decimal notation.
Frequency: Number of jobs completed over a one week period.
Retention: Time period for which jobs will remain on tape expressed in days.
Equation 2: Throughput Requirement
Throughput = size / (runLength * 3600)(1 - growth%)
Size: Expressed in MB.
runLength: Expected job completion time expressed in hours.
Growth%: Expected growth percentage based over time and expressed in decimal notation.

Using the resource equations and the values provided from the policy detail form, we can conclude that the policy requires a total storage capacity of 8,601,600 MB, or roughly 8 TB. We can also determine that an average throughput of 42.67 MBps must be achieved to back up the entire policy within the allocated run length. Therefore, all data paths from the client to the storage devices must support an aggregate throughput greater than 42.67 MBps.

The resource equations also allow us to determine the number of tapes or storage devices required to support the backup policy. If the capacity of each storage device is 400 GB compressed, the administrator can easily calculate that a total of 21 tapes will be required for this backup policy. These values can then be translated into fixed costs associated with the hardware required to support this policy.

Scenario 2: Test/QA Database

The second scenario will focus on a similar configuration as scenario 1; however, it will be implemented in a test or QA environment. The databases will be snapshots of production and, therefore, they are identical in size. They will also be used for Web form information, application processing, and reporting features. However, there will be no revenue directly attributable to these databases. By moving the system to a different stage of the product lifecycle, the decision point values are altered drastically. Through the policy detail form and subsequent resource equation values, we will show how system properties affect decision point values and ultimately change the policy details of a backup policy.

Based on the values provided in the policy detail form (Figure 2), the category totals are dramatically lower than the production environment. By simply changing the databases from a production to test environment, the value of the data has decreased significantly. This causes a reduction in the decision point values and ultimately the category totals. The decrease results in a distinct set of policy details for the data set reviewed. Looking further at the resource equations, we can determine that this scenario only requires a storage capacity of 300 GB and a throughput capacity of 21.33 MBps. These values translate into fewer resources and a reduced cost to support the backup policy.

The examples provided illustrate how unique environments and varying requirements can use a standard set of tools to define policies within a backup implementation. These can be applied to existing and future policies within your environment and offer a number of advantages for administrators and IT managers.

Conclusion

By creating a simple and systematic approach to backup design and implementation, we answer a number of questions that administrators often face. The DPS creates a level of accountability that may be referenced for historical purposes, trend analysis, scalability concerns, and resource utilization. It gives IT managers the ability to understand their backup environment without being required to know the technical intricacies of the system. It is a standard-based, visible representation of backup design and implementation.

Ian Mahaney began his professional career in 1996 with the Department of Defense. In 2000, he joined Advertising.com as a senior systems engineer, where he served until 2005. Recently, he accepted the role of Director of Technical Operations for a software development company in Stamford, Connecticut. Ian attended Western State College of Colorado and Towson University and is also a member of Phi Theta Kappa and Golden Key National Honor Society. For questions regarding this article, he may be contacted at: imahaney@bidbrain.com.