The Search for Zero

No one likes dealing with an outage. They’re bad for nearly every aspect of business: low employee morale, delayed productivity, weakened competitiveness, loss of profitability, and damaged reputation. It’s a common problem, and there are lots of claims in the market from many different vendors. You’ll hear everything from "Recover in minutes", "Avoid downtime" and "Zero data loss."

Figuring out which claims are bogus and which ones make sense for YOUR operation is daunting. Each organization has unique requirements and, since budgets are limited, IT managers must do more than simply implement gold-plated protection for every application and data repository. 

What exactly is an outage?

Graphic of an outage timeline

Before you can decide what the optimum recovery time objectives (RTOs) and recovery point objectives (RPOs) are for your business, as well as select the right disaster recovery (DR) and business continuity (BC) services, you need to understand what constitutes an outage. It’s not as obvious as it sounds.

To understand the concept clearly, I break it down for you in four steps:

  1. Awareness
  2. Resolution
  3. Failover
  4. Recovery

An outage includes all four steps, and the amount of time it takes to deal with an outage includes the time required to get through these steps. Recovery is usually the shortest part of any outage, and, as a result, you’ll often hear vendors focus almost exclusively on their short recovery times. Recovery times are fine, but that is not the full story.

Even worse, ignoring the other 3 steps (Awareness, Resolution, Failover) – and failing to take them into account – can be very detrimental to your business. When you’re making commitments to the business about how short any potential outages will be, it pays to be realistic. Let's take a look at the first step. 

Step 1: Awareness

The first stage is usually the longest phase in the whole process: figuring out that you actually have an outage. IT Support often finds out about a problem when users start calling to complain they can’t work because they can't access the system. At this point, the outage has been underway for some time and is already having a negative impact on the business.

Understanding what is really going on – is it an outage or user error, for example – can often take an hour or more. One you’ve confirmed that you’re dealing with an outage, you can move immediately to Step 2.

Step 2: Resolution

Now you must triage the system and make some quick decisions. Is it something you can fix quickly or do you need to failover to backup systems?

Sometimes what looks like a system outage is highly localized. For example, a virus-ridden laptop might be causing problems for someone, or even a group of people, while seeming like a problem at the server level. This type of local problem is still an outage as far as your people are concerned; however, you can usually deal with it quickly and without affecting the rest of the organization.

It takes another half hour minimum to confirm you’re looking at a system-level issue that requires a failover. Now you move on to Step 3.

Step 3: Failover

After you’ve confirmed there really is an outage and determined there is no quick fix, you have to failover to your backup systems. Depending on how well your backup systems are equipped and configured, this can take a few seconds or over an hour.

When you failover, there are some key assumptions you make in the hopes of a successful recovery later. However, it’s worth taking the time to ensure your assumptions are more like certainties. You must take the actions necessary to ensure that all the elements – people, equipment, communications, network connectivity, and so on – are in place and working so an unplanned failover will always result in a successful recovery.

With unplanned outages, you must ensure your data is complete and not corrupted, and the resources are available to complete the failover process successfully. If any one of these assumptions is not true, your ability to recover in Stage 4 is will be compromised, if recovery is even possible at all. 

Step 4: Recovery

We’re not finished! The clock is still ticking on your outage when the recovery stage starts!

At this point, you have to restore services to your production servers or site and reinstitute your normal BC/DR practices. If you don’t do it now, you’re putting your operation at severe risk. What if another outage occurs before then? This step can easily consume several hours. You want to be absolutely certain that all production systems are working as expected, and that your DR/BC systems are ready to handle the next outage.

Most solutions require at least a little downtime during the failover process to re-synchronize data and to establish user connections to the backup environment. The same process has to occur again, but in reverse, to bring your production systems back online. Essentially, every IT outage produces two potential business outages — two periods of time when employees can’t do their work. Be sure that you’re taking that fact into account in your planning – as well as preparing your presentations to management.

Conclusion

The four steps required for understanding an outage are Awareness, Resolution, Failover and Recovery. Knowing what the optimum RTOs and RPOs are for your business, in addition to having the right DR/BC services, are required for a successful recovery from an outage.

In Part 2 of this series, I'll address the potential financial impact of an outage and the importance of finding a data partner versus a data vendor.

To learn more about business continuity and how you can be better prepared in case of downtime or disaster, click the button below for FREE access to our Neverfail Continuity Engine data sheet!

Learn more about Continuity Engine

Subscribe to Our Blog

Recent Posts

Become an IT Service Delivery Hero

Let Us Know What You Thought about this Post.

Put your Comment Below.