The ITIL 4 incident management practice

According to the ITIL 4 Foundation Edition, “The purpose of the incident management practice is to minimize the negative impact of incidents by restoring normal service operation as quickly as possible.”

ITIL 4 continues to state that: “Incident management can have an enormous impact on customer and user satisfaction, and on how customers and users perceive the service provider.

  • Every incident should be logged and managed to ensure that it is resolved in a time that meets the expectations of the customer and user.
  • Target resolution times are agreed, documented, and communicated to ensure that expectations are realistic.
  • Incidents are prioritized based on an agreed classification to ensure that incidents with the highest business impact are resolved first.”

The good news for IT service managers who are using ITIL 4 with cloud is two-fold: first, incident management is shared with the cloud service provider to reduce the burden. Second, by exploiting the availability, elasticity, programmability, and automation of the cloud, incidents should be less frequent – further reducing the burden.

How incident management works in the cloud

Incident management is a shared responsibility between your organization and your cloud service provider when running your business and applications on the cloud. AWS, for example, is responsible for avoiding and resolving incidents in the cloud – such as the physical network infrastructure and the services that run. And your organization, as the customer, is responsible for incidents on the cloud such as issues with your applications.

It’s important to note that the above isn’t negotiable – a public cloud is not like traditional outsourcing. A public cloud has defined, standard, common features that are the same for every customer, whether you spend a dollar or a million dollars per month.

However, the reality is that if your service is running on the cloud, then your organization is still – ultimately – responsible for incidents that impact your entire service, whether it’s caused by the part run by you or the cloud service provider. Therefore, the “normal” ITIL 4 incident management practices like co-ordination definitely still apply, but you’ll need to enhance them with agreed ways to coordinate with the (external) cloud service provider support desk.

Cloud dos and don’ts for incident management

Some people imagine the cloud as an operating system, others as a large computer spanning the globe. It doesn’t matter how you view it, the following will always apply: incidents in the cloud are handled by the cloud service provider and incidents on the cloud are handled by your organization and your incident management practice.

Should incidents occur on the cloud, then your priority in incident management is to resolve it as soon as possible. It’s no different to on-premises scenarios, but cloud does offer up a number of ways to speed up your resolutions.

Do:

  • Use automation to immediately resolve incidents such as server failure – leading to much shorter Mean Time to Resolution (MTTR). An example is AWS EC2 Autoscaling which can recover servers automatically.
  • Offload more work in the cloud by using available managed services. This then makes incidents inside those services the responsibility of the cloud service provider. And, assuming that you’ve selected a cloud service provider well, they’re experts in delivering these services and ensuring very good uptimes and low occurrences of unplanned incidents.
  • Use higher-order managed services like AWS Relational Database Service because they do database clustering (high availability) and build in backups, snapshots, and restores for you.

Don’t:

  • Expect the cloud service provider to be responsible for everything running in the cloud. Remember the shared responsibility model (this is AWS’s).
  • Treat cloud resources as pets, treat them like cattle. It’s usually faster (better Mean Time to Restore) to rebuild than to fix. If a server dies, just redeploy a copy, don’t waste time and increase the outage by having to hand-fix a brittle server.

Applying your organization’s ITIL 4 incident management practice to its use of the cloud should be a very successful experience – but it requires a shift in thinking and application. For instance, the cloud is inherently highly available and if you offload more of your service management to the cloud service provider, then you should also pull back on your incident management practice efforts.

So, that’s my view on ITIL 4 Foundation’s guidance on the incident management practice in a cloud context – what would you add? Please let me know in the comments.