March 8, 2022
SREs and Devs are used to solving problems even when an awkward or inefficient way is the only way.
In AppScope 1.0, SREs and Devs have a new alternative to standard methods, that the AppScope team thinks will make that problem-solving a lot more fun.
We in the AppScope team constantly hear firsthand about life in the SRE trenches. For this blog, we “interview” a fictional SRE/Dev whose thoughts and comments are a mash-up of things we’ve heard from real people we know.
If you find that you have something in common with our composite SRE/Dev hero, we invite you to add AppScope to your toolkit, and to see how it improves your problem-solving experience. This is one of a series of blogs in which we introduce AppScope 1.0. The AppScope Origin Story begins the series, which will go on to feature Infosec, DevSecOps, and ITOps stories.
My focus is on the correctness and performance of some applications in an enterprise environment—specifically, the apps that support our HR organization. When I’m building a new application, I want to run it, I want to get some visibility into what the hell is going on while it’s running in production, in staging, or any different environment. Taking care of this is a planned, ongoing operation.
In my organization, we are innovators, or at least pretty close to the bleeding edge. Our applications are built within modern microservices. Most or all requests come in via HTTP—not in old-school RPC-based style.
We have someplace where the observability data is going, Datadog or Splunk or Elastic. That’s where the data is landing. We also have Cribl Stream, to provide some control in the middle, to route the data, to obfuscate and to do all that other kind of stuff.
Besides correctness and performance, I sometimes need to troubleshoot, to get to the root of a performance issue or a security concern that I have in the longer term.
First of all, I can’t get application-specific visibility. I’ve got nine damn agents running on these things right here. And I still don’t know what’s slow, or why it’s slow. And I don’t want to add another agent. I end up with a bunch of these things. And they’re all disparate, nothing talks to each other.
But I still don’t have complete data. Like, oh, sh*t, I don’t know details about this application. I know that these apps use AWS services, but I don’t know what kind of volume is expected, what throughput is required, what errors occur in the normal course of operation, and of those errors, which to worry about.
I ask my team: Hey, do you have anything on the box that monitors the CPU utilization of anything that’s running there? They’re like, Nope.
And I don’t have a uniform, truly comparable data across multiple applications. I’ve got a part of this app running Node, and part of this service running in Go, and another part of it is running on a web server that was written in C. This is apples and oranges, not uniform data. What I see from one piece is not going to look like what I see from the other piece.
I may not even always have correct data. Sometimes an application might log something like CPU usage inaccurately. That could lead me to think there is a problem when there isn’t, or think the cause of an actual problem is something that’s actually irrelevant. There’s no way to check for this kind of bad information.
And one more thing about agents: configuring and managing them is not the most productive use of time. You ask yourself, Where are the logs for this thing? But there’s always the risk that when you add that agent, you’ve added a log file that you’re not monitoring, and then there’s another gap in the data available to you.
It is frustrating.
I’m in Splunk for a while looking at this stuff, then I’m in Datadog in a browser, and then I’m switching into, I don’t know what-all else I’ve got, I’ve probably got three or four of these things up. And then management gets all these groups together, and now there are 20 people on a Zoom call. And we’re all trying to talk to each other about what’s really happening and not, and the network guys are looking at something and saying, Yup, I see this and that ain’t right, and the database guys, we’re looking at something else going Nope, that ain’t right. And we’re all sitting there pointing fingers and arguing …
Overall, I will use AppScope to instrument my applications and get metrics, logs, and even application-specific metrics out of them in a seamless or low-config manner. This also frees me from the worry I’ve had about unmonitored agent log files causing gaps in my data.
That’s the planned, ongoing aspect of where AppScope fits in for me. And then, AppScope is also great for diving in and investigating further as needed. That’s the ad hoc aspect.
If you use containerized workloads like I do, in AppScope you just add a couple of environment variables, then load up libscope.so or the whole thing in the container. Once you configure it that way—from there, data goes off to the Cribl backend, and from there, to Stream and on its merry way.
If you are in Kubernetes or even the traditional enterprise-stuff-on-bare-metal thing, configuring AppScope is similarly straightforward. See Editor’s Note below for details.
First and foremost, I can use AppScope to get some visibility into the logs of the application I’m focusing on. The second piece is, I get some visibility into the metrics—some of what’s called the “golden signals” in the SRE world, around, How many requests per second are coming in? How many of those are erroring out? What’s the duration—How long is that taking?
Problem number one is ensuring that I have the relevant data. And, to make sure I have what I can believe is the complete data set, that I can then go and analyze. So I start from the very top—if one of my golden signals is messed up, AppScope shows me that.
And then on top of that, I also get visibility—in case I need to go dive deeper into questions like: How is this application interacting with the other resources, wherever it’s running? How much memory is it consuming, how much I/O, how much any of the resources that are provided by whatever it runs, right?
So a user comes in and says, This is slow—I look at the app, and realize, yes, it’s slow. But with AppScope I can see that it’s slow because an external service that the app relies on is slow. It’s not the app itself that’s slow, it’s one of the external services. We aren’t getting responses back like we normally expect.
With AppScope, I can say, Here are my applications interacting with the world. And initially I can monitor a few key things, and it’s like, Hey, is the application performing as it needs to perform? If it is performing as it needs to perform, cool, I don’t have to go any deeper, right? I’m all set.
Now, when sh*t goes wrong, and I need to dig in further, with AppScope I have the data to go in and investigate. Is this a file system or an I/O problem? Is this a networking problem? Is the system bumping up against the limit of some memory and therefore doing more garbage collection?
And I can then correlate things like when garbage collection kicks off—I can tell when the garbage collection kicks off, because I can see how much memory the application uses initially. And then after a little bit, it uses less memory. And if, during this period of time, the response times started spiking, or there’s more errors … I would know that memory stuff is likely a cause of that, right?
AppScope also solves the problem I had where my data was not uniform, because part of my app is running Node, and part of a service is running in Go, and another part of it is running on a web server written in C. AppScope shows me the same data from all of those different sources. I can look at that same exact thing in a different app and have the same metrics, the same names, the same measurements, the same timeframes, all that stuff and be able to go, Oh, wait, why is that? It happened here, but it didn’t happen over there.
If you’re using the traditional enterprise-stuff-on-bare-metal thing, you have two options for configuring AppScope: you can change the command you use to start the app; or you can use the environment variable.
If you’re in Kubernetes, you can use the Admission Webhook in conjunction with the scope k8s command. See the AppScope CLI Reference.