The recent technical problems with the Delta Airlines network got me thinking about the value of business continuity planning. We teach an AIM short course dedicated to business continuity and disaster recovery planning and stress the importance of thinking through all potential scenarios. Consider this a friendly reminder to update and test your plan to make sure it is still valid. Has anything changed since your last test and could it halt your business? What is the worst-case scenario and how will you deal with it?
Delta
Delta is just the latest example of a sophisticated network of hardware and applications that failed and caused disruption to a business. In the case of Delta, a power control module failed in their technology command center in Atlanta. The universal power supply kicked in but not before some applications went offline. The real trouble began when the applications came back up but not in the right sequence. Consider application A that requires data from a database to process information to send to application B. If application B comes up before Application A, it will be looking for input that does not exist and will go into fault mode. In the same vein, if application A comes up before the database is online, it will be looking for data that does not yet exist and will fault.
Any of these scenarios will affect business operations such as ticketing, reservation and flight scheduling processes. Once flights are canceled due to lack of valid information, then the crew in San Francisco cannot get to Atlanta to start work and even more flights are canceled or delayed. In this case, it took four days before flights were fully restored. That is a lot of lost revenue and goodwill just because one power control module failed in a data center.
Disaster Recovery Planning
Information systems and networks are complex and getting more so all the time. In order to develop a plan to cover a potential interruption consider the following steps:
- Map out your environment. Understand what systems you have, their operating systems, how they are dependent on each other, and how they are connected to each other via the network. Is it critical that all these elements come up in sequence? This map will be crucial in the event you need to rebuild your systems after a disaster.
- Understand risks and create a plan. Understand your risk for each system and application. A small application that only runs once a month may not need attention whereas a customer order fulfillment application that runs 24/7 should be able to failover without interruption. Create a plan to keep the environment running or to restore it quickly.
- Test the plan. This may be the most important part of the process. Testing the plan on a regular basis ensures that you have accounted for any changes to the environment and ensures that all people are up to date on their part in the event of a problem. Periodic testing also keeps the plan active and not something that becomes “shelfware.”
Thoughts
Businesses increasingly rely on sophisticated technology in order to sell product, service customers and communicate with partners. Any break in that technology can have a real impact on revenue and the long-term viability of the business. Have you tested your business continuity plan lately?
Kelly Brown is an IT professional and assistant professor of practice for the UO Applied Information Management Master’s Degree Program. He writes about IT and business topics that keep him up at night.