ITIL 4 and Cloud: The Service Continuity Management Practice

The ITIL 4 service continuity management practice

Here’s how the ITIL 4 Foundation Edition describes the purpose of the service continuity management practice “…to ensure that the availability and performance of a service are maintained at sufficient levels in case of a disaster. The practice provides a framework for building organizational resilience with the capability of producing an effective response that safeguards the interests of key stakeholders and the organization’s reputation, brand, and value-creating activities.”

ITIL 4 continues to say that “Service continuity management supports an overall business continuity management (BCM) and planning capability by ensuring that IT and services can be resumed within required and agreed business timescales following a disaster or crisis. It is triggered when a service disruption or organizational risk occurs on a scale that is greater than the organization’s ability to handle it with normal response and recovery practices such as incident and major incident management. An organizational event of this magnitude is typically referred to as a disaster.”

For example, a cyber-attack such as a Distributed Denial of Service (DDoS) attack can disrupt a business’s processing of transactions – causing significant customer unhappiness and brand damage. Or a cyber-attack such as someone gaining unauthorized access to a corporate system and then irretrievably deleting business data – a crippling event for an organization. The speed with which an organization can get back up and running when faced with such attacks is just one example of effective service continuity management.

How service continuity management works in the cloud

Businesses of all sizes are using cloud to enable faster disaster recovery of their critical IT systems – importantly, without incurring the infrastructure expense of a second physical site. The leading clouds support many disaster recovery architectures, from those built for smaller workloads to enterprise solutions that enable rapid failover at scale. AWS, for example, provides a set of cloud-based disaster recovery services that enable fast recovery of both your IT infrastructure and data.

Leading cloud service providers, such as AWS, also have built-in durability of data and availability of systems in the fabric of their cloud. For example:

  • Data in storage services like S3 are triple-replicated (among other features) to give eleven-nines of durability (to protect against the cloud losing your data). They also give cloud customers a list of features to prevent humans from deleting data – such as requiring users to use multi-factor authentication to delete data. Plus, never actually deleting data and just marking it as deleted (and it’s therefore recoverable).
  • Each AWS region has three geographically separated availability zones (AZs). These are connected via private, high-speed, high-bandwidth fiber that’s so fast you can spread a Microsoft SQL Server database across two AZs and the latency is so low that the database thinks all nodes are next to each other in the same datacenter.

Cloud features such as AZs tend to be unique to the leaders in the cloud marketplace because of the cost and complexity of providing them. More importantly, few, if any, enterprises can build these themselves – which is why cloud democratizes access to things that “normal enterprises” cannot otherwise afford. Other examples are using cloud network features such as protection against DDoS attacks through the cloud service provider’s massive global network.

Another way that cloud helps with disaster avoidance and recovery is through the use of application architectures that let applications and data be replicated and stretched across multiple regions with architectures like Pilot Light. This is where a minimal standby application instance runs at a low cost in another region, ready to take over should the active region fail. Within minutes of a disaster event in the active region, the “pilot light” can be grown into a fully running system taking full production load.

The top cloud services providers have, of course, had outages of some services and regions but have yet to have a disaster. There are fears that leaders like AWS are “too big to fail” but the cost and complexity of replicating an AWS-based system to another cloud is often beyond the reach and resources of enterprises who also judge the risk too low against those costs.

Cloud dos and don’ts for service continuity management

The following dos and don’ts are general guidelines for great service continuity management in the cloud:

Do:

  • Understand the employed cloud services in terms of Recovery Time Objective and Recovery Point Objective. For example, cloud database backups and snapshots may need tuning to meet the objectives.
  • Use all the data features related to disaster, from backups to snapshots to archiving, encryption (at-rest and in-transit), and multi-factor authentication to avoid data deletion.
  • Backup and keep a backup of data outside the cloud, in another cloud or on-premises.
  • Restore. Be able to rebuild a system at a press of a button using solutions like Pilot Light and high levels of automation.
  • Test disaster recovery at low cost using the elasticity and agility of the cloud.

Don’t:

  • Miss out on the in-built AWS features. Too many people fail to enable the basic features here such as database high availability across two AZs – which only requires a checkbox selection.
  • Lift-and-shift non-cloud practices onto the cloud. For example, don’t build duplicate, redundant, standby systems that burn cash while remaining unused. Instead, leverage the elasticity of the cloud.

The cloud brings an effective service continuity management practice within the reach of everyone. By understanding the service continuity features of leading clouds such as AWS, an ITSM practitioner can appreciate how this practice can be improved by using cloud.