As a follow-on from my previous post Disaster Recovery is about Business Process Recovery not IT Infrastructure Recovery I wanted to talk about the sort of DR testing required.
Many companies are very aware of the need for disaster recovery testing and engage with suppliers to perform annual testing but testing a set of applications without sufficient preparation can lead to an unsatisfactory annual test. Many legacy applications have not been designed to be recovered in a disaster and require further adaptation before performing annual disaster recovery testing.
I found much is written on different types of testing but not how the testing integrates together and not how testing applies to projects delivering new applications. To help guide a disaster recovery testing strategy I created a template testing roadmap as shown below.
Across the top of the roadmap is listed different forms of testing identified in different articles (although I have seen different definitions for each of the stages):
- Checklist testing is a review of all the individual recovery checklists to ensure
- the checklists are current
- people are aware of their responsibilities
- changes in staffing are identified
- Structured Walk-Through testing brings everyone together to walk through the plan as a table-top exercise to ensure the overall plan will work together against a specific scenario.
- Simulation testing is similar to a structured walk-through with a scenario with some interruption to non-critical business activities as components testing is performed.
- Parallel testing is actual testing of modules or a full application including relocation of operational personnel in the same way as needed for a full a full disaster but the applications are isolated with no interruption to a production service. The parallel testing is split into:
- project testing – done in many cases before annual testing to ensure an application can be recovered successfully before an annual test. It can be performed where the application is new or a major change has taken place.
- annual testing – performed on a regularly scheduled basis including multiple business processes but not necessarily all processes. Some processes might require testing on a less frequent basis where the processes are mature. Testing of applications might take place every other year to reduce the cost of testing.
- Full-Interruption test where the real production applications are shutdown and moved to an alternative site. This not performed very often due to the interruption to the business.
The roadmap picture adds additional concepts:
- Component testing involving a part of an application such as directory services or email services.
- Module testing involves a collection of components that enables testing of one or more applications or business processes. As the level of testing increases a number of modules may be combined together to share the overhead of setting up the test and for annual testing that could involve a large part of a business environment.
To decided what testing is required many factors need to be considered such as:
- complexity of application – more likely that more rigorous testing is required as the complexity increases.
- risk of component failure – if some components are more likely to fail more rigorous testing may be required
- time critical recovery – a step in the process such as the time to recover data might be need repeated testing to tune the timing of the process
- business criticality – the application may have a greater impact to the continuity of the business processes if the not recovered successfully and may require more testing
- appetite for business risk – the business may be more or less risk averse and willing to accept much greater/lower levels of risk
The key message is that testing recovery of business processes and applications need to be thought through to decide what level of testing needs to be performed. There is not one solution for all and it is unlikely just an annual test without additional prerequisite testing is sufficient. Annual testing normally takes place over a short time period to demonstrate a short recovery period and there will be insufficient time to spend time debugging each application during this test.
I have been using Jungle Disk for years now to transparently perform backups on multiple home computers. I have been happy because the backups went to Amazon S3, it has its own unique private key for backup, it stored multiple versions of files and I could remotely access data from multiple computers. The problem was the cost was mounting with S3 charges, I was only able to backup critical data and it did not support mobile devices.
You may say why not use cloud drives such as Dropbox as it makes files available with near real-time updates across devices. It is easy to share folders between people but the free storage is limited with shared drives costing 5-10 times what cloud backup costs. So I went on a hunt for a replacement cloud backup.
My criteria were:
- 500Gb of storage so I could backup all my photos and music
- Able to support at least 5 computers at home – many cloud backup services are per computer
- File versioning so that if a file was corrupted I could roll back to a previous version
- Continuous backup at least once an hour for different file versions
- Support for mobile devices including iPhone and Android
- Could support a private key for encryption of my data
I spent many hours evaluating different services but either they were for a single computer (Carbonite), very slow (CrashPlan), extremely difficult to select what to backup (Bitcasa, SOS Backup, JustCloud, ZIPCloud, MyPCBackup, Backup Genie, Study Backup), did not support Android (BackBlaze) or just too expensive. Services that offered infinite storage were slow or with limited function.
In the end I selected iDrive as it met my requirements plus
- Has a way of selecting backup policy both on the web and locally – good for managing laptops remotely
- Is VERY fast achieving backups of 2-8GB per hour with good compression of data. Some such as CrashPlan were appallingly slow and it would take a year to backup.
You can get 5GB for free or if you are a student 25GB for free which is more than sufficient for critical coursework. There is a good review here.
Update: Just wanted to say I managed to backup 300GB of files over 3.5 days using BT Infinity 2 for 3 laptops and 2 Android devices. I test recovery of data and found it about 50% faster than backup. There are even scripts to backup from Linux and Synology NAS drives.
So often the basic principles of Disaster Recovery are forgotten when planning and testing. Too often the focus is put on the recovery of the network and IT infrastructure without taking into account the underlying reason for recovery is to ensure continued operation of the business processes in the event of a disaster. If only the IT and network infrastructure is considered the basic infrastructure will be recovered quickly but not the capability to continue operation of the business processes.
To help visualise what is involved I have created a simple model showing the layers needing to be considered. It starts with the business processes supported by an IT application hosted on a network and IT infrastructure managed and supported from physical locations.
At each of these layers in the model people are needed to recover and operate the ongoing service. To make this effective strong governance with clear leadership needs to be in place to ensure rapid recovery.
The IT Architecture of the Network and IT Infrastructure can only guide how the Applications above can recover. The people and systems interfaces driving the Business Processes define the data flows and order of recovery for the infrastructure below. Only by planning and testing the business processes top-down can ensure effective and efficient recovery in the event of a disaster.
I have spent a good few weeks over the past few months examining the flow of data for a data storage solution. The purpose was to find the bottlenecks in the data being transported between locations that are several hundred miles apart. The analysis looked at the overall storage architecture in both sites including the time it takes the data to travel based on the bandwidth and latency of the fibre optic cables and the intermediate components.
It occurred to me that the architectural and engineering patterns I use as an IT Architect are equally applicable to other fields of engineering and I wondered whether systems architecture skills could be transferable if supported by engineering specialists in specific disciplines.
This may be important because there is anecdotal evidence that in complex systems the cost of development of IT infrastructure and applications are becoming the biggest proportion of investment. This may increase the need for IT Architects to have a greater understanding of how to integrate IT systems with other engineering disciplines.
I thought it would interesting to do a comparison of my data storage solution to a chemical engineering solution to see what common architecture and engineering patterns there are. To do that I took the problem I had of analysing transfer of data and applied it to transfer of fluids. Similar architectures exist with transfer of fluids such as transfer of oil products between production, processing and storage/distribution sites.
It was some years ago there was a worm outbreak at a client we had just started managing where we urgently needed to identify every server infected and work on a plan to remove the worm using our own custom automated scripts (as the anti-virus vendors did not have a solution). We thought we were in a good position as the anti-virus team told me they had 99% of servers covered and the automation team said they also had a similar coverage.
If only it was that simple! It was only when we compared the two lists of the servers managed by the two teams that we discovered only 80% of the devices were the same. There were a large number of devices inactive in both tools and we discovered the state of some countries was unknown as the owners of the servers were undefined. As we went on we found devices that had not been fully built and were not recorded anywhere.
This story demonstrates the challenge all businesses face on how to continuously maintain a complete and validated server inventory so that the security of the IT environment can be maintained. Without a trusted server inventory the required security controls cannot be effectively maintained whether it is protecting the device from malicious code or controlling access to critical business data.
The solution to this problem is a closed loop process that maintains a trusted server inventory supported by network asset discovery. The process is all about automatically detecting when a device is added to the network by scanning regularly and then detecting when the device has become inactive. Critical to the process being effective is the use of the same tool to provide all systems and security management so there is only one list of servers being maintained. It means reporting on coverage for Anti-Virus tooling is trusted and if problems occur the remediation of the protection can be automated.
If you want more detail on the process read my previous post on Security without a Trusted Baseline? As an Architect the solution meets my top principle of keeping a solution simple! There is no need to introduce manual processes to reconcile different tools from many different vendors.