This is a guest article by Dan Holloran from VictorOps – an on-call alerting and incident response tool recently acquired by Splunk. They are experts in incident management.
In software development and IT operations, we tend to focus a lot of our time on the delivery and deployment pipeline. But, what happens after you deploy new services? How are you responding to incidents in production and identifying reliability concerns? Effective monitoring and alerting will help you understand your applications and infrastructure – leading to better software. Then, alongside automation, monitoring and alerting practices can lead to highly-efficient, data-driven workflows.
In this post, we’ll dive into some ways you can leverage automation in IT monitoring and alerting to build more reliable software faster. Automation can be used to improve the efficiency of people, processes and technology across the entire software delivery lifecycle (SDLC). DevOps-oriented teams will use automation alongside effective monitoring and alerting to increase workflow transparency and improve collaboration from concept to deployment.
Let’s take a peek at the high-level goals for automation in DevOps and IT, and how these goals apply to automation in monitoring and alerting.
Automation in DevOps and IT operations
DevOps and IT teams need to focus on deploying new, reliable services to production and ensuring positive customer experiences. In DevOps-centric cultures, automation is essential for streamlining workflows and allowing developers and operations teams to spend more time driving value. Taking manual tasks out of software delivery and incident management will help teams move faster and focus on what really matters.
A common misconception with automation in DevOps is that you’re working to take humans out of the overall process. What you’re really doing is using automation to make humans’ lives easier – and improve the collaboration between people, processes and technology. This way, intelligent developers and operations people can focus on building the future of the product, not responding to problems in production.
Automation throughout software delivery and incident response will lead to more reliable services at deployment – as well as a team that’s more prepared for problems when they occur. Everything from testing to release schedules to configuration management can be automated, creating less variance in the deployment pipeline – leading to more reliable software. Continuous improvement of automation across software development, release management, and monitoring and alerting builds a cohesive system of DevOps – driving faster development without hindering service reliability
DevOps monitoring and alerting best practices
With or without automation, DevOps and IT teams need a holistic monitoring and alerting strategy. In DevOps, monitoring and alerting can serve as a way to improve the visibility into system health as well as limiting alert fatigue for on-call responders. While DevOps tools can help with improving visibility and collaboration, they’re not the ultimate solution. Over time, you can improve processes to ensure on-call coverage without over-alerting.
Let’s look at the DevOps monitoring and alerting best practices in order to build a process optimized for automation and real-time incident response.
1) Track key internal and external metrics
DevOps teams need to keep a pulse on both internal metrics (e.g. throughput, success, error rates) and external metrics (latency, saturation, traffic, errors). A combination of internal and external metrics shows the internal health of technical systems as well as the effects caused by outside factors. Over time, you can track the overall availability and uptime of systems – identifying application and infrastructure pain points to be addressed.
2) Classify alert severity and define service importance
How do you know which alerts mean more than others? Do you know which services or features are more important to availability than others? In order to see the full benefits of monitoring and alerting, you need to build a system for prioritizing services and alerts. Which alerts are the most severe? How do you measure what constitutes a severe alert versus a low-priority alert? Your monitoring strategy should help you identify and respond to the biggest problems first.
3) Create visibility into system health
Monitoring and alerting should be more than simply notifying on-call users when something is wrong. Creating dashboards and charts can offer transparency into service health across multiple teams and business units. At any given time, on-call responders have access to the metrics they need in order to see exactly what’s wrong and take action to remediate an incident. Also, internal teams can share information quickly across business units and inform external stakeholders with the right information.
4) Leverage self-healing
Over-time, you can begin to see services that are likely to self-heal and organize your alerts based on this information. For example, if a service experiences ETL lag from time to time for 10 minutes or so, but that lag never affects customer experiences, then you likely don’t need to fire off an immediate alert to on-call responders. If the issue is likely to self-correct, set a threshold in your monitoring stack to notify an on-call user if the issue persists past 10 minutes. This can help you reduce alert fatigue while ensuring coverage for the service in question.
5) Identify common problem areas in your application and infrastructure
Don’t just set it and forget it with your monitoring tools. Constantly analyze the way you’re monitoring application and infrastructure health and make adjustments to improve. Over time, you can see which services are sending numerous alerts to on-call teams and you can make changes to the system itself as well as the way you handle incident response. As you start to take action on problem areas in your architecture, you create more robust services and make both customers and employees happier.
6) Don’t skip out on post-incident reviews
In association with identifying problem areas, you need to always analyze incidents after-the-fact. Post-incident reviews should serve as a dedicated time for analyzing the people, processes and technology behind any major incidents. You’re only looking at one side of the equation without a thorough understanding of how both systems and people respond under pressure. Post-incident analysis can help you understand exactly how to improve monitoring and alerting in order to proactively drive deeper system reliability.
7) Build out an incident response plan
Then, prepare yourselves for incident response. How can you surface applicable monitoring data quickly when alerting on-call responders? What information is necessary and what information adds noise? Constant tweaking and optimization of monitoring and alerting tools can help you serve the right information to the right people. Then, as long as the people are prepared to respond to incidents and remediate problems, DevOps and IT teams have everything they need to make on-call suck less.
Automation in incident management
On-call rotations and schedules
Associated with monitoring and alerting practices, on-call rotations and schedules should be set up in a single place. You can set up on-call schedules for users and leverage automation to ensure alerts are tied to specific rotations. Also, on-call schedules will rarely remain static. So, you should be able to easily switch on-call shifts with other teammates and automatically ensure all systems are covered – even in case of emergencies or unplanned absences. Keeping up with on-call calendars and schedules in a spreadsheet simply doesn’t cut it in a world of DevOps and CI/CD.
Automation can be fed into escalation policies based on specific alert rules and services. If a user doesn’t respond to an incident in a certain timeframe, it can be escalated automatically to the next user. You can use automation in escalation policies to ensure notifications are sent to the right people in a timely manner and to ensure manager visibility into on-call operations. Whether the escalation is time-based or service-based, automation can reduce the amount of time users need to identify who else needs to be involved in incident response.
Automated alert routing and on-call notifications
Related to escalation policies and on-call rotations, automated alert routing can get notifications to the right people at the right time. Automation allows teams to skip past some sort of “incident navigator” and simply sends alerts to the person or team who can fix the problem. Then, if they need to involve others, they can easily reroute or escalate issues via additional alert automation. Based on keywords in alert payloads or specific thresholds, you can determine who should get which alerts – constantly improving the way you notify on-call responders to problems in your systems.
Appending runbooks and monitoring context to alerts
Automation in incident management is often used to improve alert routing. But, what happens once the alert’s received? By appending runbooks and other wiki pages with remediation instructions, alongside applicable monitoring data such as logs, metrics, traces and charts, you provide useful context to on-call responders. Not only do they know what’s wrong but they know how they can start responding to the issue.
Context allows DevOps and IT teams to scale on-call operations and help users who’ve never interacted with a service or a system to fix problems without escalation. It limits the importance of historical knowledge and reduces the burden of on-call responsibilities put on longer-tenured engineers. Holistic monitoring will tell a complete story, not just a single chapter.
Building better software with monitoring and alerting automation
Don’t simply notify engineers when monitoring metrics pass certain thresholds, think about why. Is there a way you can improve this experience to improve the lives of customers and employees? Automation in monitoring and alerting allows you to build better software without sacrificing the wellbeing of your team. Automatically notifying on-call users and escalating issues based on intelligent rules can lead to an on-call experience that doesn’t suck – helping you promote a culture of DevOps focused on shared accountability and collaboration.
Learn about building a humane on-call experience in the free VictorOps webinar, How to Make On-Call Suck Less. And, try out the Raygun and VictorOps integration to make the most of your Raygun metrics with collaborative incident response and alert automation.
About the author Dan Holloran (@DanHolloran) is an Associate Content Marketing Manager at VictorOps – an on-call alerting and incident response tool recently acquired by Splunk. Dan is the managing editor of the VictorOps blog and spends much of his time researching and writing about topics in IT, DevOps, SRE and incident management.