Disaster Recovery Testing: An MSP’s Most Valuable Offering

Disaster Recovery Testing: An MSP’s Most Valuable Offering


You can plan, look at diagrams, and listen to experts, but you still won’t know that all your recovery capabilities for your clients will actually work without testing. Why – because it is complicated. So many things can go wrong – network configurations do not get replicated properly, application dependencies are not in sync across sites, disaster recovery (DR) resources may become insufficient over time, and so on. And the worst time to identify an issue is during an actual emergency.

The cost of failure is high. Not only will you lose customers, but failing to recover or deflect attacks also results in client application downtime. While the cost of application downtime varies dramatically across industries, an average for all organizations is about $9,000 per minute in decreased productivity, lost revenue, and other negative business impacts per The Ponemon Institute. 1

While providing a 99.9% uptime is great, shortening that to 99.99% saves the average organization about $350,000 per month. This level of uptime is possible only if data protection and recovery solutions are tested regularly with issues identified that may slow recovery. But how often should you test your customers’ DR and what tools should you use?

Average Cost of Downtime Per Month

Average Cost of Downtime Per Month

Current State of Testing

In Unitrends’ annual survey of backup and recovery practices, we asked over 800 IT professionals how often they test their recovery capabilities. These responses are from organizations not using an MSP and demonstrate how your services can fill their voids.

The majority of responding organizations reported testing their DR plans only once per year, less, or not at all. Many changes occur in IT infrastructure over the course of a year. In addition, most industry governing bodies require companies to have business continuity plans and know and document their results. With the pressure to keep businesses up and running, why do end users test so infrequently?

Why Test?

Let’s step back a second and make sure everyone agrees that regularly testing recovery is the right thing to do. What does testing provide?

Client retention

IT downtime is one of the leading causes of customer dissatisfaction that can cause them to look for a new MSP. Other issues with an MSP can be resolved, but losing revenue from preventable downtime is hard to forgive.


Documenting and maintaining a recovery plan is now an essential task for IT service providers. You need to prove to your clients that you have a verifiable process in place to recover their business services. You need to give clients confidence that their business can run in an emergency and documentation that proves it.

The only way you know you're not wasting time and money on an expensive recovery plan is to test it regularly and set the results. Period.

Identification of recovery issues

A good testing process will identify issues that impact recovery. These can then be addressed prior to a real emergency.

Compliance reporting

Most companies have industry mandates that require protection against loss of data and functionality. Health care organizations, for example, have HIPAA mandates that require not only DR protection, but also that recovery technologies are tested regularly so auditors can see the results. You need to enable your clients to meet their compliance requirements.

Job security

Failing to test the recovery capability of your clients leaves your business exposed to disasters that can cause them to leave your services and put your business at risk.

IT professionals aren’t stupid. If testing is so important why don’t MSPs test more often? The reason is testing can be costly, difficult and risks interrupting critical business processes.

Excuses Organizations Use for Not Testing More Frequently

When polled, most organizations cite several reasons why they don’t test more often:

Testing takes time

Technicians are already overwhelmed with the day-to-day tasks of managing complex and extensive IT infrastructures of their clients.

Testing can disrupt production systems

Many test procedures require that you bring down production to validate the test. If traditional restore processes are used, this downtime process can be very impactful. So much so that MSPs feel the need to cut corners in the process, which leads to false positives, high risk, and wasted time. In addition, customer infrastructures can vary greatly, causing much of the testing procedures to be “one-offs.”

Testing can cost money

Outside providers are playing an increasing role in providing disaster recovery services. MSP engineering resources would be better used dealing with paying tickets rather than manual testing.

While most end-user organizations will claim to have a documented DR plan, most are not regularly updated, dependent on a few key individuals on a spreasheet or Word Doc, and are difficult to locate.

Undocumented DR plan

MSPs sell DR services but it is up to the client to document it well enough to be useful in a downtime event. While most end-user organizations will claim to have a documented DR plan, most are not regularly updated, dependent on a few key individuals, on a spreadsheet or Word Doc, and are difficult to locate. Testing will more than likely create the need to update the plan, requiring more time be spent on something everyone hopes will never be used.

Levels of Testing

There are many different ways MSPs can test data backup and disaster recovery capabilities. Some are very basic and check only that data has been replicated to the right location. Few test functionality to ensure clients can actually perform their jobs. Vendors of backup and recovery technology may support only lower levels of testing, yet claim they have full testing capabilities, so research is required to determine exactly what they support. Examples of the different levels (basic to advanced) are:

Data verification

This test just checks that blocks or files are good after they have been backed up. Needless to say, this level of testing does nothing to ensure the applications can be functionally recovered.

Database mounting

Verifies a database has basic functionality within backups.

Single machine boot verification

Verifies that a single server can be rebooted after a downtime event.

Single machine boot with screenshot verification

Goes a little past boot verification by sending an image of the operating system splash screen to administrators as proof the system can be recovered.

DR runbook testing

Multiple machines are spun up for testing. This is especially important for multiple servers that deliver a business service together, such as an ERP system or clustered databases.

Recovery Assurance

The highest level of testing as it includes multiple machines, deep application testing, SLA assessment, and analytics as to the reason any recovery failed. Anything other than Recovery Assurance still leaves questions about a full recovery. For example, booting to the splash screen may indicate that the operating system can boot, but it proves nothing about its functionality and responsiveness to the business.

Your clients believe they have outsourced disaster recovery to your business. MSPs need to include testing as part of their offering, with the best MSPs providing full Recovery Assurance testing.

Vendors of backup and reocvery technology may support only lower leevls of testing, yet claim they have full testing capabilities.
Disaster Recoverr-as-a-Service providers embed testing as part of their offering with the best performing full recovery assurance testing.

Those Old Excuses for Not Testing are No Longer Valid

Fortunately, there are cost-effective, intelligent technologies available that can automate, orchestrate and analyze application recoverability to ensure entire workloads are functional and, if not, report what is broken. Additionally, you and your clients can get an easy-to-read formal report certifying the final results of a DR test that can be shared with auditors and the client’s senior management. These tools automate testing so you know exactly how fast and to what point your client’s data and applications are protected without requiring much manual work or extra expense on your part. The excuses for not testing are no longer valid:

Testing takes time

No more, it is fully automated. The results are emailed to you and/or your clients for a simple review to see that everything was successful.

Testing can disrupt production systems

No more, testing can be conducted within isolated labs that shield production applications from network conflicts. They can be run in alternate locations, such as cloud infrastructure. Additionally, they often do not even require additional storage given that a test can be run against your backups.

Testing can cost money

Testing capabilities are now included on advanced backup and recovery appliances and cloud services. Having a standardized, repeatable, easily replicated process you can perform across your client list will save large amounts of money and time.

Undocumented DR plan

DR Specialists can help develop your DR plan, and new web tools can keep it readily available to everyone on the team. In addition, Recovery Assurance can document its DR recovery steps as part of its runbook setup.

Also, all of these past excuses are why your clients have chosen to use an MSP. They want you to do all recovery testing and take on the business risk of slow recoveries. This can be a significant revenue stream — if you have the tools to test efficiently.

Next-Generation Technology

These next-generation technologies can automate testing and reporting to give you and your clients 100% confidence that recovery can and will take place as required.

Recovery Assurance

Recovery Assurance delivers fully automated recovery testing. Running either locally on the backup appliance or in the cloud, Recovery Assurance will automatically test and certify full business service recovery. Using backups, the entire infrastructure is recreated and booted up to ensure that all data and application dependencies are correct.

Tools automate testing so you know exactly how fast and to what point your data and applications are protected without requiring much manual work or extra expense.

Recovery Assurance can be directed to use as many of 50+ built-in tests that are appropriate for your environment. These can include:

  • Running and verifying a database query
  • Mailbox transport submissions on Exchange
  • Validating service availability
  • Building and running your own custom scripts to test unique aspects of your workloads

Unitrends MSP Recovery Assurance verifies the success of each point in time that an application is protected. It includes built-in analytics that assess the impact of an outage in terms of its projected downtime and data loss. It also reports the results of recovery testing to business stakeholders (no setup required) in the form of actual RTO and RPO achieved and flags warnings against their goals.

Copy Data Management

Backups are not just for recovery anymore. There are many corporate benefits that can be derived from backup files. Copy Data Management (CDM) is a concept that, on command spins up test/dev environments identical to the production servers because they use the latest backups. The technology makes your clients’ latest data, applications, and lab environments available instantly for testing purposes. MSPs are responsible if a software patch causes issues with production applications, potentially costing many hours of engineering time and customer satisfaction. MSPs can use these sandboxes to identify issues with new software by testing them prior to deployment on production servers. Other uses can be for compliance testing, reporting, and any other purpose that requires fast, temporary access to cloned environments without over-utilizing production resources. Once all testing is finished, the entire test environment can easily be torn down.

BCDR (Business Continuity & Disaster Recovery) Link

A well-documented DR plan is critical to the testing process. MSPs have typically created DR plans in Excel or Word that are then filed away and rarely dusted off. These can be hard to manage, store, and have limited access to those who need it. BCDR Link (https:// bcdrlink.com) is a free online tool that helps you build and customize a DR plan for your customers. The template follows the most up-to-date guidelines of International Organization of Standardization (ISO) standard 22301 that specifies security requirements for DR preparedness and business continuity management systems (BCMS) and includes all steps necessary for a comprehensive recovery plan. The advantage of online access is that everyone knows where it is stored, and the author can control who has access.

MSPs can outsource to DR professionals

MSP can contract with Disaster Recovery as a Service (DRaaS) professionals, thus passing the benefits of its amped up capabilities back to their clients. DRaaS has greatly evolved from its first iterations. World-class DRaaS providers now offer “White Glove” services that can free MSPs from having to learn, manage and deploy recoveries. DRaaS White Glove providers will do complete DR planning, including setting up entire runbook sequences so business-critical applications are the first to recover.

The amount and frequency of testing should match the critical nature of the system.

Recovery is initiated by a simple phone call with the service provider doing all the work. The best part is that DRaaS White Glove providers offer both 1-hour and 24-hour service-level agreements (SLAs) for application recovery with financial recourse for any delays. This high-touch version of DRaaS can be managed and deployed from any location and protect remote sites around the world.

How Often Should You Test Client Recoveries?

If you have the latest technology, how often should you test recovery? A quick Google search for “How often should you test your DR” shows most vendors and analysts won’t give detailed advice. They mostly agree, however, that organizations of all stripes don’t test enough – but even that advice assumes that there is a definition of “enough.” Gartner, in its report “Modify Your Backup / Recovery Plan to Improve Data Management and Reduce Cost“ (February 2017) advises “Perform data recovery testing at least once a year on a subset of data to ensure the backup strategy effectively meets the stated SLA projections.” This advice reflects the old world of testing and not the new reality.

The real answer is ‘It depends’

The real answer is the amount of testing should match the critical nature of the system. Prioritize client testing based on the criticality of business services and work your way back to the infrastructure that supports them. If clients have machines that require high responsiveness, validate this more often in your testing. Don’t assume that performance is sufficient. Leverage technologies such as replicas vs. backups for highly transactional machines, and include performance scripts to validate responsiveness.

Testing is far easier and more cost-effective than ever. It takes some resources for spin-up, but in many cases those are already available or can be supplemented by using backup storage with modern technologies. You should test your clients’ ability to recover as frequently as you can based on your available resources and their RPOs. Remember, every untested recovery is a risk to their RPOs and can increase the amount of data lost in an outage because you weren’t able to verify that a recovery point was successful.

MSPs using next-generation technology can frequently test their entire client infrastructure to greatly reduce the risk of recovery issues. If this is the case, you should be testing automatically at least monthly and on-demand after changes are made to their infrastructure! Remember, client IT infrastructures are not static environments. Adding new applications, virtualizing new servers, upgrading software, and moving assets to the cloud can break elements of client DR plans, so retest after infrastructure changes are made. If changes are made frequently, you want frequent testing. If changes are not made frequently and data change rates are not high, you can potentially provide less-frequent testing.

Testing is so important that you should include this metric in any DR solution offerings to prospects. To ensure faster and easier recovery testing, MSPs should choose a solution with automated, free Recovery Assurance. Or better yet, outsource your client’s DR program to certified experts that will guarantee your performance with SLAs.