As time goes on, your number of alerts will grow. This can lead to problems for your organization if they're not managed correctly. Additionally, your alerts will give you crucial information you can use to improve your system, and if you don't take advantage of that information, you won't be using your alerts to their fullest potential.
You can find out how to manage the quality of your alerts to prevent things like alert fatigue, as well as how you can use to gather data and drive positive impact to your organization by following the process below.
Optimize your alerts
Reducing unnecessary alerts helps ensure the alerts you receive are the most relevant ones. We've created an Alert quality management dashboard to make that easier. Essentially, you'll be installing a dashboard, gathering information, then making changes based on the information you've gathered. We've outlined each step in this process to make it more easy to get the results you want from your alerts.
To get started optimizing your alerts, you need to do the following:
Install the AQM dashboard
- Go to the Alert Quality Management instant observability page.
- Click on Install now.
- Choose an account to install the dashboard into.
- View your dashboard.
Analyze your KPIs
The dashboard will help you understand how you're doing using four KPIs (key performance indicators):
Incident count: alerts with a high number of incidents
Accumulated incident time: alerts with high cumulative durations
Mean time to close: the amount of time it takes until incidents are closed
Percent under 5 minutes: the amount of incidents open for less than 5 minutes
The Alerting Count by Policy pane in the dashboard helps you identify these alert policies and determine any relevant patterns.
Establish your baselines
The AQM dashboard gives you a baseline of KPIs that you can use to begin the improvement process. You (and anyone on your team) can review the most active policies from the previous step to reduce alert noise. Ask yourself questions about what the data is telling you and how you can fix them, such as:
Are the alerts telling us something about a resource that needs to be fixed? If so, then fix the problem and see if the alert volume decreases.
Are the alerts telling us about something that actually requires an immediate response? If not, then adjust or disable the policy.
Are the policy thresholds set properly? If not, then consider adjusting the thresholds.
You should deal with the incident alerts using the following guidelines after establishing your baselines:
- If you look at an alert and decide to take any sort of further investigative action, acknowledge the alert.
- If you typically close an alert without doing anything else, don't acknowledge the alert.
- If the incident alert is always on, don't close or acknowledge it.
Gather your data
It takes some time to accumulate your alert data from the dashboard. You should wait at least two weeks to gather this data, but check regulary to ensure that the incident responders for your alerts are following the guidelines outlined in the previous step.
Check your data against your baselines
After two weeks, you should have enough data to analyze and begin your alert improvement process. To improve your system using the alert data, follow the steps below:
- Analyze the week-over-week trends in your KPIs. Find the areas that you may need to fix and you can begin finding ways to improve them.
- Use the data to map the current quality of your alerts. You can identify areas where improvement has positively impacted the business and areas where problems have resulted in negative outcomes.
- Use the dashboard to identify the noisiest incident policies.
- Review the policies identified in the previous step. For each policy, try to determine if the alert is relevant, properly configured, and what the alert tells you about problems that you may need to address.
- Identify what areas you can work on to improve the policies you reviewed. This should be a technical analysis, and should end with recommendations in how to fix problems in your system that trigger the alert, how to tune policies that need improvement, or how to fix any gaps in your instrumentation.
After completing the procedure above, you're well on your way to using your alerts to improve your system and provide a positive impact to your organization. This is only the beginning though: there's a lot more possibilities for using alerts than what we've covered here. For more detailed information on alert quality and KPIs, see our Alert quality management docs.