Preparing for Disaster: Everything You Need To Know About DR Tests

June 28, 2013 Off By David
Object Storage

Grazed from Xtium.  Author: Shawn Fichter.

Stress, hassle, time. These are all words typically associated with disaster recovery tests. In the past 15 years I’ve spent working in the business continuity space, I’ve seen DR tests be a burden. However if done correctly and properly documented, they don’t have to be. So what do you need to do to ensure you’re prepared for the inevitable? Here are the common questions I’ve heard from customers about DR tests over the years.

What is a disaster recovery test?

A DR test is a process to evaluate the procedures that an organization has in place for ensuring business continuity in the event of a major service interruption or catastrophic event. Tests allow service providers and their customers to ensure the integrity of protected data as well as execute and refine processes for making that data available in the event of a declared disaster…


Why is disaster recovery testing important to my organization?

Disaster recovery testing is important because it gives you the ability to assess and correct your procedures before an actual disaster forces your organization to have to use them. When performed on a regular basis, DR tests are a critical component of a sound business continuity strategy.

How often should I do a disaster recovery test?

Typically DR tests are performed once per year, but some organizations require multiple tests each year. Xtium sees this in the case of highly regulated industries such as financial services or other businesses in which the frequency of testing is mandated by regulatory compliance guidelines. 

I recommend testing after any major event throughout the year, such as changing office locations, opening a new office, major architectural or infrastructure upgrades, or mergers and acquisitions. 

When do you do disaster recovery tests?

DR tests are performed around your organization’s schedule. This is to ensure that the appropriate personnel are available to participate in the test and assist in creating a DR run book.  

I typically see a spike in testing activity in Q3 of each year as companies prepare for end of year audits, a DR test being one of those audit requirements.  I would consider annual testing a starting point for your organization.  As I mentioned above, ensuring your DR plan is up to date as changes take place to your organization is the only way to truly know you are ready to handle unforeseen disruptions. 

Why is a run book critical to a DR test?

A disaster recovery run book is a working document unique to every organization that outlines the steps the company needs to take to recover from a disaster. Run books, or updates to run books, are the outputs of every DR test. 

A run book includes a prioritized sequence of events for the successful recovery of business services, applications and infrastructure components. Without this prioritization recovery efforts can actually stall the resumption of business services to your customer base. I typically see the reestablishing of corporate communications as a first step in recovery efforts.  Email leads the way most often, allowing employees and customers to communicate effectively throughout the remaining recovery effort. Once communications have been established, mission critical (often scored by revenue generating) services are next in line for recovery.

Run books also contain a list of critical contacts that can be reached in the event of a disaster. These include the IT department heads, leads for specific applications, administrators of particular systems, and the managed service provider, among others. Knowing who to call for what, especially during a disaster event, is critical to your organization reestablishing itself. 

What kind of downtime should I expect during a disaster recovery test?

None. Tests are performed in an isolated environment that runs parallel to production. The two systems don’t cross-communicate, so production systems won’t be affected during a test. I strongly advise against actively failing over production systems for testing purposes due to the risk of compromising data while in a test state.

How should I prepare for a disaster recovery test?

First things first, establish your DR team. The team should include the head of the technology department (i.e. CIO, CTO or Director of IT), leads and/or power users of specific applications (i.e. email, business apps, etc.), administrators of particular systems, and other third-party IT providers you may use. 

Identify which infrastructure components, applications and business services this test will include. And don’t forget to set success and failure criteria.  Remember a test is meant to show you where the plan has holes, so don’t be afraid to have the test fail! 

Walk me through the steps of a typical disaster recovery test.

Here are some high level steps that you should make sure happen as you enter into your testing scenario.  Each test will be a little different so we’ll keep this at a “framework” level.  Make sure that each of your tests contains these steps, and that the supporting activities are reviewed once the test is completed. 

Steps include:

1.     Declare a mock “disaster.”

2.     Execute a communication plan.

3.     Identify order of restoration of failed systems.

4.     Execute run book restoration procedures.

5.     Validate recovery and data integrity.

6.     Communicate results.

7.     Conclude test and return to normal operating procedures.

Can I cheat on my DR test?

While it would certainly be possible to cheat on a test, why would you want to? Going through your stated recovery procedures on a regular basis is mutually beneficial for both your company and your provider. There’s no incentive for either party to cheat on a DR test because in the event of a true disaster, you’ll want to ensure that your tests have vetted and documented all of the correct steps and procedures for recovery.

What are RPO and RTO and why do they matter?

RPO (Recovery Point Objective) is the point in time from which your data is restored. I see typical RPOs range from 1 hour to 24 hours depending on the sensitivity of the data and the business service that data supports.  The more real time the service, the lower the RPO and vice versa.

RTO (Recovery Time Objective) is the amount of time it takes to recover your systems in the event of a disaster. This is the amount of downtime that is acceptable before there is significant impact to business operations. The same principle that holds true for RPO also supports RTO targets.  More critical systems have lower RTO while less critical have higher RTO.

Both RTO and RPO are affected by the amount of data that is protected/replicated and the change rate of that data. When change rates are lower, RPO can be a lot shorter. The greater the rate of change, the longer a backup can run.

How do you know if a disaster recovery test was a success?

While each organization can and should set their own success criteria there are a few constants that I’ve seen across all testing scenarios. The success of a DR test is contingent on three things:

1.     Ability to meet your assigned RTO and RPOs for each system, application, or component.

2.     Successfully restoring your data and system connectivity.

3.     Processing business transactions in a DR / test environment.

What do you do with the results of a disaster recovery test?

The results of the test go into the organization’s DR run book with a list of any known issues or workarounds to be performed to the systems upon restoration. The run books are the go-to documents in the event of a major service interruption or catastrophic event, and regular DR tests enable your organization to be aptly prepared.

Conclusion:

Creating a disaster recovery plan is important, but you cannot stop there. Testing those DR plans is the critical success factor for your organization; DR plans cannot be “shelf ware.”  Don’t be afraid to let your plans fail. The results in a regular executed and updated DR plan will help you rest assured that your data, applications, and business systems are truly ready for anything.

##########
About the Author:

Shawn Fichter is a data center veteran who has been working in the DR and business continuity space for over 15 years. He has developed, managed and executed DR tests in both a customer and provider role. Over the course of his career, Shawn has been involved in managing infrastructures through the transition of client server to utility computing to cloud-based service models. He now leads the development of next generation cloud-based products and services for Xtium.