Friday 5 October 2018

Disaster Recovery: A Dynamic Redundancy Approach

The problem with disaster planning is that it is not rehearsed. When you need to retrieve a file from backup is when you discover that your backup has been broken for three months.

Modern cloud systems, based upon software defined infrastructure and redundant, auto-scaling fleets of micro-services, come with disaster recovery built in. They are designed to be resilient against DDoS attacks, unexpected peaks in usage and continent wide unavailability.

Some systems have yet to migrate to outsourced infrastructure, some never will migrate. For these systems we need a Disaster Recovery Strategy which can be implemented within reasonable costs and ideally does not suffer from the fails when needed feature of many backup systems. One answer is to do regular fire drills. No one would dispute the importance of fire drills in saving lives and ensuring that people know what to do in the case of a real fire, however we all know there is a big difference between a rehearsal and the real thing.

The key insight in the modern cloud architectures is that every version of a system is the same (at a particular time).

We can reduce this to a minimal redundant system: a pair of identical systems with one designated Primary and the other Secondary, with a standard data mirroring link from Primary to Secondary.

To ensure that both elements of the pair really can function as the Primary you could rehearse a cutover one weekend.

But if the two systems really are identical then there is no reason to reverse the cutover at the end of the rehearsal. The old Secondary is the new Primary, the old Primary is the new Secondary. The Primary can be swapped at a periodicity the business is comfortable with, say twice a year.

This Dynamic Redundancy strategy ensures that your Disaster Recovery works when you need it to and can be adjusted according to the business' appetite for risk.