The purpose of the ITIL 4 availability management practice is to ensure that services deliver agreed levels of availability to meet the needs of customers and users.
When business services are running on a cloud such as AWS, Azure, or Google, it refers to the availability of the individual cloud services used to run an application and provide a service.
Availability management activities thus include:
- Negotiating and agreeing on achievable targets for availability
- Designing infrastructure and applications that can deliver required availability levels
- Ensuring that services and components are able to collect the data required to measure availability
- Monitoring, analyzing, and reporting on availability
- Planning improvements to availability.
The two key metrics for measuring availability performance are:
- Mean Time Between Failure (MTBF) – bigger is better here, i.e. the distance in time between outage events
- Mean Time to Resolution (MTTR) – smaller is better here, the distance in time from the start of an outage to its resolution.
The great news for IT service managers who are using ITIL 4 is that the cloud is inherently better than non-cloud systems for availability.
How availability works in the cloud
The leading cloud service providers (CSPs) have availability – or resiliency as they sometimes call it – as one of their central tenets. AWS, for example, has reliability as one of their five pillars in their very popular Well-Architected Framework.
The leading clouds are designed to be highly available with billions of dollars invested in redundancy features. Compared to ITIL 4 availability management practice guidance, as documented in the ITIL 4 Foundation publication:
- Availability metrics are publicly stated in terms of uptime and durability and leading clouds often post public post-mortems on outages.
- Data is replicated many times in multiple locations for 11 nines of durability.
- The leading cloud service providers use multiple availability zones in a region. Your application can be “spread” across these zones to be resilient to scale events such as seasonal demand or local system failures.
- The leading clouds have a wealth of data to feed into availability metrics.
- Leading clouds have little-to-no planned or unplanned downtime, unlike traditional non-cloud systems.
That said, even though the cloud is inherently highly-available, there are things you must do and things you must avoid to benefit from the cloud’s availability.
Cloud dos and don’ts for availability management
To exploit the cloud’s inherent reliability characteristics in improving your availability management practices, please consider the following dos and don’ts:
- Architect your application to scale out across availability zones to be resilient to scale and outage events. In AWS, use Availability Zones with Auto Scaling Groups.
- Use automated recovery procedures linked to monitoring. In AWS, use Auto Scaling Groups linked with Cloudwatch to automatically respond to events.
- Read the Reliability pillar of the AWS Well-architected Framework even (perhaps, especially) if you’re job role isn’t “cloud administrator.”
- Practice recovery from local failure to full-system disaster recovery. Cloud systems can be spun-up and deleted – meaning that there’s no massive investment in standby systems, and more opportunities to test recovery.
- Use non-cloud capacity management. Avoid resource saturation – and outages – by using the elastic nature of cloud. In AWS, this means using more, smaller scale-out EC2 instance types with Auto Scaling Groups.
- Lift-and-shift non-cloud compute onto cloud and expect it to be instantly leveraging all cloud features. Often, migrated non-cloud applications need to be remodeled to exploit the cloud.
- In general, don’t treat cloud like you treat non-cloud – AWS is better than any on-premises platform, but it’s also different. Understanding, and exploiting, the differences is the key to success.
Applying the ITIL 4 availability management practice to your organization’s use of the cloud should be a successful experience. By exploiting the cloud’s reliability characteristics, which might mean rearchitecting applications, your measure of MTBF and mean time to restore service (MTRS) will be drastically improved.
Also, not all organizations have dedicated staff for availability, so off-loading applications onto the cloud not only improves availability but also reduces the staff burden.
So that’s my view on ITIL 4 Foundation’s guidance on the availability management practice in a cloud context – what would you add? Please let me know in the comments.