Jason Walker is Chief Customer Officer at BigPanda, helping companies improve IT Operations performance and efficiency using AIOps.

Alert quality is one of the most important aspects of the incident management lifecycle, yet it is one of the least acknowledged. It’s vital for organizations to focus on continually improving alert quality, so why does it fall so low on the priority list? Here we’ll dive into what alert quality is, why businesses should care about improving it and its overall role in IT operations. 

What is alert quality?

Alert quality is at the front end of your IT operations pipeline. For all alerts — from all the different source systems that come into your operations pipeline — there is a minimum set of requirements they need to meet to be deemed high quality. Those requirements are a combination of actionability, clarity and the presence or absence of contextual information. 

If you have strong actionability and clarity, people know what to do and they know they need to take action (along with which actions to take). If you have enough context, diagnosis takes much less time. If you can tick all those boxes, you’ve got a high-quality alert that you should allow to enter your operations pipeline. 

On the other side of the spectrum is the low-quality noise that perhaps isn’t actionable or doesn’t have enough attributes to be processed well by the systems you’re using to process events. That noise needs to be identified and eliminated.

Why should we care about alert quality? 

Every alert that is converted into an incident requires the IT operations team’s time and effort. If they assign it out, someone else also spends time reviewing it and deciding what to do about it. A low-quality alert that someone on the L1 team has looked at, for example, that goes to someone at the L2 layer requires more time and effort to identify — becoming more and more expensive for the business as it is passed around. These teams owe it to all the engineers in an organization to ensure that everything that comes out of their pipeline is high quality.

Low alert quality dramatically slows down your IT operations and the entire incident management lifecycle, no matter how great your tools and people are. The real danger here is that you’ll spend brainpower and organizational manpower going after low-quality alerts instead of the ones that truly signify something. If this keeps happening over time, your engineers start to lose trust in the IT operations team; trust is critical when you’re waking people up in the middle of the night to put out a fire.

Why do low-quality alerts exist?

There are a few main reasons for low-quality alerts.

1. Low-quality alerts are a partial hangover from ITIL and previous generations, where alerts were difficult to configure and, therefore, lower in volume. Organizations believed that if someone spent the time to create this alert, then it must mean something important. That mindset persists as a reluctance to ever ignore an alert. 

2. Modern monitoring and observability solutions make it incredibly easy to generate large volumes of alerts, and few controls are in place to prevent the accumulation of those over time into an unmanageable mass of mostly noise.  

3. IT feels a huge burden of responsibility to examine every single alert in case it becomes a precursor to a bigger failure. An alert might be ignorable, nonactionable noise 99% of the time, but in the 1% instances where it fires as an indication of a failure, IT operations teams are often blamed in the incident post-mortem for missing that indication.  

Why aren’t IT operations talking about this more?

Most of the time, IT operations is not the team configuring the alerts, and so they have little control over the quality. There can also be organizational barriers, such as inter-team relationships, or domain-specific engineering capabilities that prevent the IT operations team from effectively communicating about alert quality.

Awareness of the problem is also an issue: IT operations teams don’t tend to question their pipelines or examine what they are converting into incidents; they lack the perspective required to examine and improve their inputs. On domain-specific engineering teams, few engineers conduct periodic audits of the alerts their services are firing, or are willing to turn off old alerts that someone else created.

How Organizations Should Address Alert Quality

First, organizations need to put in a technical gateway. The key is to identify some minimum required attributes that all alerts must have, including a unique alert ID, an owner, a business service that the alert relates to, a severity and then some sort of required action such as an escalation or a link to a run book. 

These minimum attributes make it much easier to process alerts because you’ll be able to scrap anything that doesn’t correlate with them. Since an IT operations team could have thousands of engineers potentially sending alerts, it’s critical to not only standardize what comes in but to also relate events to each other based on the agreed-upon common set of attributes.

IT operations teams also need a process in place that addresses alert quality. Over time, IT operations will run hundreds or thousands of incidents a week, which then turns into a rich set of data that can be visualized. Alerts that fire thousands of times per day that are just noise can — and should — be blocked until improved. Identify the worst offenders in terms of alert quality and create a quarantined list of alerts that need to be improved by their owners before they are allowed back into the pipeline.  

In the end, the goal is to help IT operations teams become the guardians of alert quality and build a pipeline that uses monitoring effectively in order to identify actual incidents while also minimizing noise.  

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Source link

Leave a Reply