Lessons Learned – Disaster Recovery Plans

Nuclear Mushroom Cloud

We have been working with a number of clients reviewing their Disaster Recovery Plans (DRP’s), the move to remote working due to Covid-19 has triggered the Business Continuity Plan (BCP) and that has turned the spotlight on how the business reacts when faced with adversity. When reviewing the DRP’s it was regularly found that they are not fit for purpose and are un-tested. Common mistakes we have seen are:

  • The RTO is not achievable, it is in effect a number presented to the board that makes everyone happy – an example being a client that had a lot of legacy systems, including made to order industrial protocol interface cards, they held no spares and the lead time was considerably greater than the RTO, by a factor of 30 days!
  • The RPO is not achievable, again it is just a number plucked from thin air that is deemed to be acceptable – an example being a client who ran a facility across 2 sites with 24/7/365 operations, they relied on the transfer of backup tapes between sites at 1600 every day, which would be fine except the transfer did not happen at weekends and public holidays due to reduced manning levels, and the RPO was 24 hours.
  • No consideration has been given to the possibility that the system expert may not be available to recover the plant, so less experienced operators may be following the plan. In our experience there is always that person on a site who is the fountain of knowledge, the one person who knows where to kick the turbine to get it to start. That person is the single point of failure. DRP’s and Incident Response Plans (IRP’s) that rely on knowledge in a person’s head are worse than no plan. The stress felt by the person who has diligently followed the plan and it still is not working because they don’t know the trick is unacceptable.
  • No consideration has been given to the sequence required to recover from a true disaster, such as multiple servers bricked by ransomware. When many servers are taken off-line there can often be a very critical sequence for recovery, and the problem is all of them could have the same recovery priority in the DRP, but there is no flow-chart documenting the sequence. The worst-case scenario for this is the chicken and egg scenario, which is normally only found during real DR testing.
  • Failing to manage change – an example of this is a client who had not shut down one of their server rooms for a number of years, and it was not until a planned maintenance window initiated a managed shutdown that they discovered that because everything was set to boot when power was re-applied the inrush current was so great it kept tripping the main breakers – do your power consumption calculations as part of the management of change (MoC) process, the same with cooling calculations, we have seen server rooms where the air conditioning can keep the room cool enough, but is not sized to actually bring the temperature down from a heat soak event.
  • No consideration that things may get worse before they get better. Our experience shows that including a flowchart based process within the DRP, with space to put target times and achieved times, allows people to realise that the recovery is behind schedule before its too late, and having prepared the flowchart with the luxury of time to consider actions, it is easy to initiate additional plans as contingency, as opposed to having management standing there shouting “We are behind schedule what are you going to do?”
  • No testing of the plan. There is a military motto – Train Hard, Fight Easy. If people know what to do recovery can be achieved with ease, if the first time someone opens the DRP is when the server room is silent it’s going to be along night. By testing the plan improvements can be made and efficiencies achieved – an example being a client who had shelves of spare PLC cards, which was a shame because several of the electrolytic capacitors had dried out and the cards were useless. We worked with the client to build a facility to keep the hardware warm, which had the added benefit that as part of their MoC they flashed the hardware with the backups, resulting in a faster response time in the event of failure.
  • No hard copies – if I had a pound for every-time I was told the plans were on SharePoint I would be a very rich man.
  • No involvement of the OEM’s – when it really goes wrong every organisation will need the assistance of the OEM’s, be it Microsoft, VMWare, Siemens, ABB etc. If the SLA with the OEM is 8 hours but the call goes out with only 8 hours left in the RTO window there is going to be trouble – again this is where our experience with flowchart based response plans and the use of timings allows for easy management of escalation before it’s too late.
  • Not using a standards-based approach – there are several established standards that we recommend reading as part of the development of a fit for purpose Disaster Recover and Incident Response planning process, for example:
    • ISO/IEC 27031:2011 Information technology — Security techniques — Guidelines for information and communication technology readiness for business continuity
    • ISO 22301:2019 Security and resilience — Business continuity management systems — Requirements
    • NIST 800-34 Rev. 1 Contingency Planning Guide for Federal Information Systems

These are just a few examples of areas where we have seen room for improvement. If you think your BCP or DRP could benefit from a review please get in touch.