If you were pulled into a meeting right now and asked to give your thoughts on how to achieve better outcomes with monitoring and observability, what would you recommend? Would you default to suggesting that your team improve Mean Time To Detect (MTTD)? Sure, you might make some improvements in that area, but it turns out that most of the opportunities lie in what comes after your system detects an issue. Let’s examine how to measure improvements in monitoring and observability.
Imagine it’s the last Friday afternoon of Q2 — you’ve just finished a review cycle when you realize that your current MTTD is sitting at five minutes. You stay at a Holiday Inn Express for the weekend and miraculously devise a way to reduce it by 20%. In this situation, you did knock one minute off of your MTTD. But if it’s taking your SREs an hour or two to resolve issues after they’re detected, it’s probably time to focus your efforts elsewhere.
Most large companies stop at detection and ensure they receive alerts, which are both undoubtedly necessary. But a constant barrage of alerts can have the opposite effect on their value. Receiving alerts is essential, but deciding whether they are meaningful or actionable is just as important. Do you understand what the alert is telling you? Does it even need your attention?
DevOps engineers need to embrace being knowledgeable about monitoring and observability and how they use it as a tool beyond detection. After you receive an alert and decide it’s both meaningful and actionable, you have to see if there’s anything else going on. A single symptom doesn’t always lead to an effective diagnosis, so getting a view of the bigger picture is essential. Then you have to figure out how to use your tools to dig in and determine the root cause of the problem, so you can get your service back up and running reliably, even if you don’t have a permanent solution at the moment.
One way to get engineers on board with this change is to focus on reducing toil and getting them back to development. Shifting incentives is as much a mindset game as it is a game of finding the right tools, and the process has to be done as if you were steering a cruise ship — you can’t stop or reverse direction immediately, so small changes implemented periodically are the name of the game. Processes can’t automate themselves but humans with proper training, the right tools, and incentives can.
Tools must be selected carefully because ease of adoption is probably the most critical factor in motivating engineers to use them. Not only that, but once they are baked into your processes, the chances of getting rid of any of these tools are slim, whether they’re widely deployed or not.
Once you decide to shift focus and your team is on board, you’ll have to figure out a way to measure improvements, which can be hard to quantify. The focus is usually on budgets, SLOs, metrics like MTTD, and number of incidents — but you can’t forget about humans in this equation. You want to find out if the quality of their work has improved. Are they able to determine if alerts are meaningful and actionable? Are confidence levels increased when it comes to what’s happening within the system and how well can they investigate and remediate alerts?
Most important, how are they engaging in the learning culture around monitoring and observability? Success here involves engineers taking what they’re learning from these incidents and architecting it into new things they’re building. There aren’t many quantitative metrics to measure here, but once your organization has reached the point where this is more the rule than the exception, you can be confident that you’re on the right track.
You’re missing out on a huge opportunity if you aren’t empowering your DevOps teams and SREs to do things beyond monitoring and reducing MTTD. Your engineers will be happier at work if they have the right tools and can spend less time dealing with incident bridges and other things they’d prefer to stay as far away from as they can. Focus on long-term outcomes, the efficiency, and the effectiveness of your system, and don’t get too distracted by cost reduction.
Cribl Stream can help with all these things and give you control over and insights into the data you collect. Use it to route data to multiple places to break down data silos and increase tool choice and flexibility. You can also stash low-value data in cheap storage and replay it back through your data pipelines and to your destination of choice. You can use it for cost considerations, removing unneeded fields, and dropping important events to reduce storage costs. Learn about the other benefits of Stream by exploring any of the courses in our sandbox.
The fastest way to get started with Cribl Stream and Cribl Edge is to try the Free Cloud Sandboxes.
Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.