Standardize
Your Backups with a Decision Point System
Ian Mahaney
Backup and restore policies can be difficult to define and understand.
While business priorities shift, new products are introduced, and
requirements change, administrators are faced with the task of managing
an effective backup solution that spans the entire enterprise. Today's
dynamic environments often result in uncertainty regarding the properties
of a backup policy and leave IT managers asking why? To simplify
the creation of backup policies and answer the question why, administrators
can utilize a Decision Point System when determining backup requirements.
The DPS (Decision Point System) is based on a set of tools that
have been developed to standardize and simplify the creation of
policy definitions. These tools offer an easy approach to determining
your backup requirements and the fixed costs associated with those
requirements. They also provide consistency when determining which
resources are allocated per policy and allow administrators to be
proactive in predicting resource utilization over time. Their use
creates a level of accountability that can be referenced when IT
managers ask why?
The fundamental tools that are the basis of the DPS include Policy
Decision Points, Decision Point Values, and Decision Point Policy
Tables. Each tool maintains a specific role within the DPS. These
roles will be discussed in detail and presented in a real-world
context to easily understand their implementation in relation to
the DPS.
Before beginning a DPS implementation, administrators must understand
the properties of a backup policy. All backup policies have an associated
window, run length, frequency, and retention period. These items,
in addition to backup size, are the key factors in how many resources,
what type of resources, and for how long resources are occupied.
They are collectively referred to as policy details. Determining
the value of policy details is accomplished through the use of policy
decision points.
Policy decision points usually consist of both business- and technology-based
rule sets. They are the foundation of the Decision Point System
and are always created first in a DPS implementation. Within the
rule sets, each decision point is relative to a policy detail. Therefore,
they are divided into two categories: Frequency and Retention along
with Run Length and Window.
Each decision point should also be assigned a percentage value
that will remain constant throughout the DPS implementation. The
sum of the percentage values within a category must equal 100. The
percentage is typically based on an assumed overall value with consideration
for company objectives and product importance. They are normally
created through discussions between business owners, such as product
managers, and administrators. This is often the most difficult task
of a DPS. As such, an example of a decision point set and associated
percentage values is provided below.
Decision Points and Their Percentages
Category 1: Frequency and Retention
- 35% -- Revenue associated with the data being backed up
- 30% -- Loss projections in the event of a failure
- 20% -- Timeframe from which data may need to be restored
- 12% -- Modification and access of data
- 03% -- Type of data
Category 2: Run Length and Window
- 40% -- Restore times in the event of a failure
- 40% -- Business cycle downtime impact
- 20% -- Performance impact of backup jobs
The policy decision points and their percentages are then used
in the calculation of decision point values. They can be applied
to existing or future policy definitions. Decision point values
ultimately determine the details of a policy definition. They are
easily defined by reviewing each decision point and how it relates
to the data set to be backed up. This relation is then translated
into a numeric value between 1 and 10 and assigned to the decision
point for that policy.
A score of 10 indicates that the decision point is critical and
should receive the full percentage associated with it. A score of
6 indicates that the decision point has a decreased value for this
backup policy and should receive only 60% of the total percentage
possible. The numeric value assigned to each decision point within
a category is then multiplied by the associated percentage. We then
sum the total of all decision point values and multiply by 10 to
receive a point total per category. These point totals are then
used in referencing the decision point policy tables to determine
an appropriate window, run length, frequency, and retention period.
Decision Point Policy Tables
To follow the DPS standard, policy definitions are created using
the decision point policy tables. The tables provide a simple alignment
of policy details and are arranged in order of category point totals.
They are typically based on business logic translated into a technical
representation through the use of a risk assessment. To further
understand how they are defined, we must look at each one individually.
The details of the frequency-retention table are based on the available
values within your backup solution and the overall risk assessment
of all the data to be backed up within the enterprise. For example,
a risk assessment may determine that the most frequently backups
should be performed is hourly and the least frequently is weekly.
In this scenario, the hourly value would be associated with the
greatest point total while the weekly value would be associated
with the lowest value. This logic also holds true for the retention
period; however, the retention periods are matched against the frequency
values. If the risk assessment has determined that data should be
stored for a minimum of 2 weeks and a maximum of 6 months, then
these values must be systematically defined in the table while matched
against a frequency value (see Table 1). Our risk assessment has
determined that are minimum and maximum retention periods are 2
weeks and 6 months, respectively. Because two identical frequency
values are defined within the table, we systematically match our
retention periods within the table as 2 weeks and 4 weeks or more.
The run-length and window table is defined similarly to the frequency-retention
table. However, this table must take into consideration a company's
business cycle. The business cycle allows administrators to determine
utilization trends over a specific period of time and can usually
be depicted as a wave or set of waves with peaks and valleys. Using
the risk assessment and business cycle, administrators can define
their windows and run-length values within the table. This can be
seen in the Table 2. The business cycle has determined that our
down periods are between 8 p.m. and 8 a.m., while peak periods are
between 8 a.m. and 8 p.m. To follow good practice, critical backups
should be performed during non-peak hours. Since critical backups
are typically calculated at high category point totals, we assign
our non-peak hours to the greatest values within the table and our
peak hours to our lowest values. These values should then be spread
over the point total rows defined within the table. As with the
previous table, we are then able to match run-length values against
defined window values.
In this example, it has been determined that the minimum run length
of any job during non-peak and peak hours should be 1 hour. This
value is then assigned to the greatest category point total during
non-peak hours and the greatest category point total during peak
hours. It has also been determined that the maximum run-length period
should be no longer than 8 hours. Using the range of 1 to 8 hours,
we then match the run-length values to the window values as we did
with the frequency-retention period. When all values have been determined,
the tables are then ready for use. To utilize the tables, an administrator
simply locates a data set's category point total within the table
and implements the policy detail associated with that category point
total in their backup policy. The examples provided in Tables 1
and 2 may be used as guidelines when determining your decision point
policy tables.
Calculating decision point values and using the decision point
policy tables is essential in standardizing and providing a successful
DPS implementation. However, one of the key advantages of standardization
and the DPS is the ability to gain a level of accountability for
backup policies. Accountability is derived from the use of the policy
decision points, decision point values, and decision point policy
tables but will be lost without the ability to record this information
for historical record, future trending, and analysis.
To record this data and simplify your calculations, a policy detail
form is included with the DPS. The policy detail form should include,
but is not limited to, policy name, data type, size, decision point
values, cumulative category scores, resource equation values, window,
current run length, allocated run length, frequency, retention,
date modified, and backup administrator.
Using the policy detail form along with the examples provided,
we can easily construct real-world scenarios to gain a better understanding
of how different decision point values affect the outcome of policy
details within the DPS. In the following scenarios, we'll review
two data sets and apply the DPS to both. Each case utilizes identical
data types but differentiates by access patterns and business importance.
Scenario 1: Production Database
In the first scenario, we'll review a new production database
environment. The environment will maintain a set of databases that
are to be used for Web form information, application processing,
and reporting features. A large number of products will depend on
the availability of these databases and the data they maintain.
Any downtime within this environment would contribute to significant
revenue loss. The total size of the databases is estimated to be
approximately 300 GB.
The characteristics mentioned previously provide enough information
for administrators to begin a DPS-based policy definition. To do
so, we must first review our policy decision points and determine
how relative each point is in regard to the data that must be backed
up. Because each decision point is outlined in the policy detail
form, it is easy for administrators to simply input the values they
believe to be valid for a specific decision point. The form shown
in Figure 1 depicts the decision point values associated with each
point in the DPS implementation. These values have also been used
with the decision point policy tables to determine frequency, retention,
run length, and window.
Now that the policy detail form has been completed, we can use
these values to determine the resource requirements necessary for
the policy. To do this, we use a set of resource equations that
are provided below:
Equation 1: Storage Utilization
Storage Utilization = (size)(growth% + 1)(frequency)(retention)
Size: Expressed in MB.
Growth%: Expected growth percentage based over time and expressed
in decimal notation.
Frequency: Number of jobs completed over a one week period.
Retention: Time period for which jobs will remain on tape expressed
in days.
Equation 2: Throughput Requirement
Throughput = size / (runLength * 3600)(1 - growth%)
Size: Expressed in MB.
runLength: Expected job completion time expressed in hours.
Growth%: Expected growth percentage based over time and expressed
in decimal notation.
Using the resource equations and the values provided from the
policy detail form, we can conclude that the policy requires a total
storage capacity of 8,601,600 MB, or roughly 8 TB. We can also determine
that an average throughput of 42.67 MBps must be achieved to back
up the entire policy within the allocated run length. Therefore,
all data paths from the client to the storage devices must support
an aggregate throughput greater than 42.67 MBps.
The resource equations also allow us to determine the number of
tapes or storage devices required to support the backup policy.
If the capacity of each storage device is 400 GB compressed, the
administrator can easily calculate that a total of 21 tapes will
be required for this backup policy. These values can then be translated
into fixed costs associated with the hardware required to support
this policy.
Scenario 2: Test/QA Database
The second scenario will focus on a similar configuration as scenario
1; however, it will be implemented in a test or QA environment.
The databases will be snapshots of production and, therefore, they
are identical in size. They will also be used for Web form information,
application processing, and reporting features. However, there will
be no revenue directly attributable to these databases. By moving
the system to a different stage of the product lifecycle, the decision
point values are altered drastically. Through the policy detail
form and subsequent resource equation values, we will show how system
properties affect decision point values and ultimately change the
policy details of a backup policy.
Based on the values provided in the policy detail form (Figure
2), the category totals are dramatically lower than the production
environment. By simply changing the databases from a production
to test environment, the value of the data has decreased significantly.
This causes a reduction in the decision point values and ultimately
the category totals. The decrease results in a distinct set of policy
details for the data set reviewed. Looking further at the resource
equations, we can determine that this scenario only requires a storage
capacity of 300 GB and a throughput capacity of 21.33 MBps. These
values translate into fewer resources and a reduced cost to support
the backup policy.
The examples provided illustrate how unique environments and varying
requirements can use a standard set of tools to define policies
within a backup implementation. These can be applied to existing
and future policies within your environment and offer a number of
advantages for administrators and IT managers.
Conclusion
By creating a simple and systematic approach to backup design
and implementation, we answer a number of questions that administrators
often face. The DPS creates a level of accountability that may be
referenced for historical purposes, trend analysis, scalability
concerns, and resource utilization. It gives IT managers the ability
to understand their backup environment without being required to
know the technical intricacies of the system. It is a standard-based,
visible representation of backup design and implementation.
Ian Mahaney began his professional career in 1996 with the
Department of Defense. In 2000, he joined Advertising.com as a senior
systems engineer, where he served until 2005. Recently, he accepted
the role of Director of Technical Operations for a software development
company in Stamford, Connecticut. Ian attended Western State College
of Colorado and Towson University and is also a member of Phi Theta
Kappa and Golden Key National Honor Society. For questions regarding
this article, he may be contacted at: imahaney@bidbrain.com. |