The ITIL 4 monitoring and event management practice

According to ITIL 4 Foundation Edition, “The purpose of the monitoring and event management practice is to systematically observe services and service components, and record and report selected changes of state identified as events. This practice identifies and prioritizes infrastructure, services, business processes, and information security events, and establishes the appropriate response to those events, including responding to conditions.”

Two key parts of the monitoring and event management practice are:

  • Monitoring configuration items (CIs), observing their state as it changes. These observed changes are not necessarily “events,” because with cloud CIs are no longer “bits of tin in a data center” and are instead cloud services.
  • Dealing with events – often classified as information, warning, or error (or more complex).

This is a difficult practice to get right in the real world. To find the right balance between not enough visibility and being swamped by too much information – it’s commonly a mixture of high technology, art, gut feel, experimentation, learning, and luck.

How monitoring and event management work in the cloud

All leading public clouds come with varying maturity levels of monitoring and event management capabilities built in. These cloud-native tools help you to monitor and event manage cloud services out of the box.

There’s good news that the practice burden should be reduced – because all of the stuff “in the cloud” is hidden from you because the cloud service provider takes care of it. All you need to worry about is what’s “on the cloud” – which is the cloud services and their consoles, dashboards, and feeds that are visible to you.

Here are some examples:

  • If you’re using the low-order cloud services (sometimes referred to as “infrastructure-as-a-service” or IaaS) like virtual machines – effectively you’re treating the cloud as a remote co-location site. Here you can monitor everything inside the virtual machine and upwards to the application, but nothing about the physical server it runs on or anything physical “below” the virtual machine.
  • If you’re using higher-order services like AWS Relational Database Service (RDS), then you have even less “plumbing” to worry about because AWS manages more for you. Here you don’t even need to monitor the virtual machines because AWS manages them for you. Instead, they’ll expose more database-related things that you care about like query speeds and index performance.
  • Your cloud service provider will have a monitoring dashboard. These offer the usual charts, search, and other features, and they are often inexpensive for low use. But fees can add up if you’re capturing a lot of logging and generating a lot of events – it’s pay-for-use after all. The cloud service provider provided tools are also unsurprisingly less feature-rich compared to high-cost specialist tools, but they’re usually good enough.
  • Your cloud provider will provide inexpensive tools to capture events. AWS Cloudtrail is a good example of a way to log what happens across a cloud, and AWS Simple Notification Service (SNS) and Amazon’s Cloudwatch service are good examples of cloud-based event management tools. Other leading clouds also have easy to find comparable tools. You simply configure the CI/cloud service (e.g. relational database) to send its events to SNS, and you tell SNS to pass on the events – email Ops, text the CEO, or notify a third party such as PagerDuty to wake someone up in the middle of the night. You can even get the cloud to run a function on an event to do some self-healing to save texting the CEO.

Cloud dos and don’ts for monitoring and event management

In true cloud style, cloud service providers “democratize” the technology for professional monitoring and event management – with the cloud service provider tools good enough to get the job done in most cases.

Do:

  • Know what operational health means for your applications/business and only collect the data that tells you important things.
  • Accept that this is a learning process, especially if you’re new to cloud.
  • Use higher-order cloud services to reduce your monitoring and event management practice burden – let the cloud service provider do the heavy lifting and operations for you.
  • Continuously improve your signal-to-noise ratio. Focus on the events that you can act upon and prune out data and events that are noise.

Don’t:

  • Let your non-cloud monitoring and event management practice drive your cloud strategy. For example, don’t “do virtual machines in the cloud” because you know how to monitor them and avoid higher order services because you don’t know enough about them (yet).
  • Immediately invest in expensive third-party monitoring systems. Try the built-in tools first and hope that they’re good enough because anything else is expensive and time-consuming and a distraction from building your business.
  • Insist on handling events manually (by having people point eyeballs at screens). Use the inherent automation, queues, and notification systems of the cloud to make it self-healing where possible.

In the AWS Well-Architected Framework there’s good advice in the Operational Excellence Pillar on how best to approach monitoring and event management practice. Please check that out too.