Observability 101

Table of Contents

What is Observability?

Observability is a growing practice in the world of software development and operations. Observability gives you the opportunity to learn about the important aspects of your environment without knowing in advance the questions you need to ask. Put more simply, it seeks to answer how much you can understand about a system by looking at it from the outside.

For example, how much can you tell from outside the aircraft about whether a jet engine is working? You can probably tell if it’s running because it’s making noise and vibrating — and maybe that it’s running properly if there’s not a lot of violent shaking inside the cabin, if the plane flies, and so on. But to learn more intricate details, you need sensors — different ones that measure things such as how much air is being pulled through the engine, how hot the engine is, the consistency of the RPM, and how efficiently the plane is consuming fuel.

Observability is like an analog dial; the amount of observability you employ can fall anywhere along a wide spectrum, spanning from zero to acutely tuned to every aspect of your environment that’s measurable. You can have low or high levels of observability, but what matters is if you have the level that your business needs.

How Do You Determine Your Needs With Observability?

If all your business cares about is if the engine (from the preceding section) is working to spec, then that’s all the observability you need. But, what if you also wanted to know how the passengers in the plane are doing?

For that, you’d need an entirely different set of sensors: air pressure and quality, temperature, and maybe even the output from onboard surveys. Or maybe you’d want to know how many times people complained to flight attendants.

Similarly, if your business depends on how happy your customers are, you need more and different kinds of sensors to determine this value. You need to generate, collect, and analyze the data from these sensors, as well as store it somewhere.

For example, break down your sales targets a bit to see where observability can help. For instance, assume you’re a clothing retailer. Sales are impacted by the demand for your clothes, your prices, store locations, and your online presence. The obvious place for observability to play a part in sales is web sales. Ask yourself the following questions:

  • How quickly does our website load?
  • Are there differences in site performance for mobile or desktop browsers?
  • Can we guarantee the safety of sensitive customer information (don’t you think twice about buying from a company that just announced hackers compromised its customer data)?

If you look a little closer, observability can also address in-person sales:

  • Is our inventory system giving fast, reliable answers to buyers so shelves are stocked with the most popular styles and brands?
  • Does our loyalty application run so slowly in the store that customers forego signing up for a discount leading us to lose a new advertising channel?

To optimize observability, you must seek to understand the myriad ways your IT systems impact the goals of the organization. Then you need to start compiling a list of questions about how your systems, applications, network, and so on are operating to have these impacts. After that, you translate those questions into things you can measure. Then, what measurements are acceptable — for example, you want to have sub-second response times for returning queries about available inventories for a specific item.

So, what kinds of sensors should you have in place to make the measurements that will help you understand how well things are running? You can collect this data in several ways.

  • Servers often take snapshots of their operations at regular intervals and write them into logs
  • Logs can also be created by log scrapers or forwarders — pieces of software that look at your systems, take measurements, and write them into logs.
  • Agents are another type of software that sits on all the endpoints of your systems and collects metrics that explain what’s going on within your environment.

These measurements can be on time intervals or be related to interactions and transactions within your applications and systems. For example, when a sale is made, you want to understand how much time it took to complete, how quickly you return a confirmation message, and what percentage full the database is where you store information about the sale. It can also be when a database with customer information is queried — which servers made this request? Is this a trusted server or a potential threat?

Understanding the Environment You’re In with Observability

At a minimum, your enterprise’s business goals almost certainly include strong security, high-service stability, and increased customer happiness. To understand how you’re meeting those goals, you must collect and analyze data correlated with your desired outcomes. But how do you do that, you ask?

How do you collect the data you need to get answers about your environment?  Start by collecting and analyzing the three main components of observability – metrics, logs, and traces. To have a comprehensive view of how your environment is performing, most organizations collect data from hundreds to hundreds of thousands of sources. Once you have the data you need to get it into the tools you use for analysis in the right format.

Observability: Asking Questions About Your Data

In order to ask questions of the data, it has to be structured in a way that the analytics tools your organization uses can understand. Unfortunately, many data sources have unique structures that aren’t easily readable by all analytics tools, and some data sources aren’t structured at all. Some of the tools your organization uses to review and analyze data may expect the data to have already been written to log files in a particular format, known as schema-on-write, and some tools involve an indexing step to process the data into the required format as it arrives, known as schema-on-read.

We’re still in the early days of observability maturity, but early stumbles point to where observability must go in the future. Watch this video from CEO Clint Sharp about the future of observability.

Structuring Your Observability Data

Applications are typically instrumented by the developers who write them. The goals developers have for their instrumentation aren’t often the same as the goals of the operators who run their applications or the end-users who interact with them. Developers tend to instrument to identify problems with the code and not so much with the consumer experience in mind. The goals of these operators may change over time and well after the developer has moved on to other projects.

Finding Your Right Solution Using an Observability Pipeline

To take advantage of the wide variety of options possible for structure, content, routing, and storage of your enterprise data, an observability pipeline allows you to ingest data and get value from that data in any format, from any source, and then direct it to any destination — without breaking the bank. An observability pipeline can result in better performance and reduced infrastructure costs.

As your goals evolve, you have the freedom to make new choices, including new tools and destinations as well as new data formats. The right observability pipeline helps you get the data you want, in the formats you need, to wherever you want to go.

Data volumes are growing year over year, and at the same time companies are trying to analyze new sources of data to get a complete picture of their IT and security environments. They need the flexibility to get data into multiple tools from multiple sources but don’t want to add a lot of new infrastructure and agents. These companies need a better strategy for retaining data long-term that’s also cost-effective.

An observability pipeline helps you parse, restructure, and enrich data in flight — before you pay to store or analyze it. It gets the right data, where you want, in the formats you need. Observability pipelines unlock the value of observability data by giving you the power to make choices that best serve your business without the negative tradeoffs. As your goals evolve, you have the freedom to make new choices including new tools and destinations.

Determining Your Enterprise’s Business Goals

If observability is about seeking answers to questions about how well your IT environment is running, then you need to know how to measure what’s acceptable for meeting the goals of your business. Some of these goals revolve around the following:

  • The cost of storage and compute infrastructure required to run your environment
  • How secure your applications and data are from threats
  • How well your systems are performing
  • How stable your systems are

After you understand how your IT environment affects your business goals, you need to come up with a list of metrics that indicates success or failure as well as values for each of these metrics so you know when you’re succeeding or failing.

Choosing a list of what metrics to measure and what outcomes are good or bad is a subset of the practice of observability. For a long time, IT departments have been held to Service Level Agreements (SLAs), which are essentially contracts that stipulate what level of service they provide. For example, one may be that your website needs to have 99.9 percent uptime, requiring scheduled or unplanned outages to comprise less than 0.1 percent of the time. An evolution of the SLA is the Service Level Objective (SLO). SLOs are the goals the IT department maintains to deliver better service. SLOs are often more stringent than SLAs because they’re aspirational and help set the priorities of the organizations whereas SLAs are the bare minimum of what’s acceptable.

Gleaning Information from Your Valuable Data

After you establish your goals and you know what to measure — and what the most relevant data sources are — you need to figure out how to make sense of this data and turn it into insights. A few steps lie between getting your data and learning from it. Observability practitioners generally consider three main pillars of data as inputs for learning about an IT environment. These pillars are metrics, logs, and traces.


Metrics are numeric representations of data measured over intervals of time. Metrics can harness the power of mathematical modeling and prediction to derive knowledge of the behavior of a system over intervals of time in the present and future.

For example, every time you go for a medical checkup, the nurse who takes you back to a room collects a set of metrics:

  • Height
  • Weight
  • Blood pressure
  • Temperature
  • Pulse

The nurse logs the time, as well as other “dimensions” such as your name, patient number, what doctor you’re seeing, and the reason for your checkup. Think of this collection of multiple metrics, with one set of dimensions, as a metric event.

In the digital world, dimensions may include an application name, a host name, a payment type, or whatever else may have been of importance to the developer, who wrote the code, and the product manager. From computers and the Internet of Things (IoT) devices, you may get metric events that include

  • A timestamp
  • Metric values, such as
  • Percentage of CPU utilization
  • Percentage of memory in use
  • Load average
  • CPU temperature
  • Dimensions, such as
  • Hostname
  • Location
  • Department
  • Business function

Each of these events can be used to analyze and report on what’s measured.


A log is a system generated record of data that occurs when an event (see the preceding section) has happened, and this log describes what’s going on during the event. A log message contains the log data. Log data is the details about the event such as a resource that was accessed, who accessed it, and the time. Each event in a system is going to have different sets of data in the message.

Think about a ship’s log from back in the old wooden sailing-ship days. Several times a day, the captain (or someone assigned to the task) noted standard things:

  • Date and time
  • Heading and speed
  • Latitude and longitude
  • Personal notes, such as:
    • “Today, we ran out of rum.”
    • “The cook burned his hand, and Dr. Smithson bandaged it up.”

For each time he wrote something down, he entered a log entry, and the book he wrote in was the logbook. A captain’s log was the collection of all that captain’s logbooks.

In the digital world, logs refer to information being written by the operating system and by the applications and processes running on that system. As with the captain’s log example, each log entry usually includes a set of standard things to report, such as the following

  • The date and time of the event (also known as the timestamp)
  • The name of the system logging the data
  • The severity of the event (critical, warning, and so on)
  • The application name

In reality, the log message format is generally a lot less wordy than this example. Log messages may be in key-value format, JSON format, CSV, or literally anything else. Translating one format to another, on the fly, can help get logs generated by one system into a completely different tool.

In the digital age, log entries are called log events. Log events of a particular type, or those from the same source, are written to a log file locally — or sent across the network to another system. There are different approaches one can use in the transmission of the log events, but generically you can refer to the whole process as “sending log events to a log server.” And, just as you had a captain’s log (things logged by the captain), the digital equivalent of this includes Windows security logs, web server logs, email logs, and so on.


A trace marks the path taken by a transaction within an application to be completed. This may be a query of the database or execution of a purchase by a customer. Think of those Indiana Jones movies where they showed the red line traversing the globe to represent how he got from one adventure to the next. A single trace can provide visibility into both the route traveled as well as the structure of a request.

The path of a request allows software engineers and Site Reliability Engineers (SREs) to understand the different services involved in the path of a request, and the structure of a request helps you understand which services, other applications, databases, and other system elements are involved in the transaction or request.

In application monitoring, a trace represents all the things an application transaction spent its time on — all ordered over time from beginning to end. Ant farms are a good way to think about how traces work. They visually display the path an ant takes to do its work. Traces, similarly, show how different parts of an application perform their jobs. Traces also show where potential problems may occur. If you’re constantly seeing the biggest bottleneck at the point you query a database, you can see what else is happening at the same time:

  • Are there conflicts?
  • Are other applications trying to access the same data at the same time?
  • Is there an inefficiency in the code that makes this request?

Traces show an application developer how the application is doing its work so she can prioritize areas to be optimized. If something is broken in the overall application, traces can show what’s happening before and after the problem to pinpoint what needs to be fixed.

Delivering Data to Your Analytical Tools

The data you need to collect can come from some or all the pillars of observability (see the earlier section “Gleaning Information from Your Valuable Data”), and believe me when I tell you that many strong opinions exist around which pillar is the most valuable, what kind of data to collect, and how to collect it. But in reality, it all depends entirely on your business and its goals.

Most organizations collect a wide array of data from multiple sources. After the data is collected, you have to get it to a tool to analyze it. How does this happen? Pipelines stream data from sources to their destinations. After collecting data, pipelines then stream it to an analytics tool in the format required by that tool. Whether you’re looking at metrics, logs, or traces, it’s likely that you’re generating far too many to make sense of it all without using an analytics tool.

Each tool your organization uses may have widely different formats for how data can be read and interpreted. Think of something as simple as a timestamp. Some tools may be formatted YYYY-MM-DD hh:mm:ss; others may include fractional seconds or the day of the week in an abbreviated format such as Thu. Different tools may also have different names for field values or expect log data to be in a specific order. Regardless of the format of the tools you use, you have to deliver that data to the tools in a way that they can use.

Data usually streams in real-time from your collectors to your analytical tools. Streams of data are produced in the form of events. As events occur, metrics, logs, and traces are generated and enter the stream. These events may be regularly scheduled, such as time intervals to check various measurements of your environment. They can also be related to something that happens in your environment — an application notices that a customer has made an order, or you get an unauthorized access request to your inventory database.

Regardless of why the event enters the stream, you need to decide how to get that data to the right tool to be analyzed. In simple environments, you can create unique pipelines for each pair of data sources and destinations. For most organizations, however, that approach will quickly become cumbersome because you have multiple tools analyzing overlapping pieces of the same data.

If you currently use Splunk, you’ll want to watch this video to learn more to improve its performance.

Why Use a Highly Flexible Observability Pipeline

To take advantage of the wide variety of options possible for structure, content, routing, and storage of your enterprise data, you need an observability pipeline that allows you to ingest and get value from data in any format, from any source, and then direct it to any destination — without breaking the bank. The right approach helps you find the balance between cost, performance, complexity, and comprehensiveness.

What’s crystal clear is that enterprises need the flexibility to make tradeoffs on a source-by-source basis. For each shape of data in each log file, the decision may be different. Making these types of data reshaping decisions often involves going back to the developers and asking them to log differently. For security professionals, your infrastructure software and cloud vendors dictate the formats of much of your data. Shuffling data off to cheap storage to have a cost-effective place to land full-fidelity data may be possible with your existing tooling, but replaying the data is an intense manual effort of manually running scripts and workarounds. Even deciding at ingestion time where to send data is impossible in most pipelines. Most solution ingestion pipelines are built only for that solution.

An observability pipeline can simplify these decisions as the data is moving. It can increase or decrease the number of fields based on simple, graphical no-code pipelines. It can also allow you to defer this decision-making until later by spooling data to cheap object storage like S3 compatible storage or data lakes and replaying it later if you decide the data needs to be analyzed further or shaped differently.

Ten Reasons to Use an Observability Pipeline Like Stream

1. Route Your Data to Multiple Tools and Destinations

With an observability pipeline, you can take data from any source and route it to any tool. Put data where it has the most value. Route data to the best tool for the job — or all the tools for the job.

2. Reduce Data With Little Analytical Value to Control Costs

An observability pipeline can help you reduce less-valuable data before you pay to analyze or store it. This process can help you dramatically slash costs, eliminate null fields, remove duplicate data, and drop fields you’ll never analyze. Using an observability pipeline means you keep all the data you need and only pay to analyze and store what’s important to you now.

3. Transform Data Into Any Format Without Adding New Agents

Take the data you have and format it for any destination, without having to add new agents. By transforming the data you already have, and sending it to the tools your teams use, you increase flexibility without incurring the cost and effort of recollecting and storing the same data multiple times in different formats.

Agents are software deployed on your infrastructure that writes out information about what’s going on in your servers, applications, network devices, and so on.

4. Retain More Data for Longer Periods by Routing a Copy of Your Data to Cost-Effective Storage

You never know when you may need a piece of data for later investigation, so you can hold on to your data longer by routing a copy to cost-effective storage. Send a copy of your data to cheap object storage such as data lakes, file systems, or infrequent-access cloud storage, and you’ll always have what you need — without paying to keep the data in your systems of analysis.

5. Replay – Collect Data FROM Cheap Storage and “replay” to an Analytics Tool Later as Needed

Usually, data is streamed to analytics tools and analyzed in real-time. Some data may be interesting to analyze at a later point in time to learn about trends or to investigate a security breach that may have happened in the past.

Collecting a subset of data from low-cost object storage and re-streaming it to any analytics tool as needed is a  process that increases flexibility and comprehensive visibility while minimizing costs. Learn more about Replay with observability.

6. Enrich Data With THIRD-Party Sources Like Geo-IP and Known Threats Databases to Give Deeper Context

An observability pipeline adds context to your data for a more comprehensive analysis. Sometimes adding a small amount of data can unlock answers to critical questions. You can enrich your current data streams with key pieces of information to build a more comprehensive view.

Third-party sources, such as known threats databases, can enrich log data by alerting security systems to events that may require more scrutiny. They can also help you ignore log data from trusted domains. Similarly, another third-party source like a Geo-IP lookup table can provide geographical information for more context when performing investigations of past events.

7. Mask Sensitive Data to Protect Your Customers and Limit Liability

An observability pipeline configures data streams for maximum protection. Redaction and masking keep sensitive data private. You want to protect your customers and limit liability, and it’s easy to do just that with an observability pipeline.

Redaction is the omitting of sensitive data elements such as Credit Card Numbers from the view of people, applications, or processes accessing the data. Data Masking is the hiding or obfuscation of sensitive data so that it cannot be accessed by certain people, applications, or processes.

8. Manage Who Sees What With Role-Based Access Control

Role-based access control allows you to assign access policies — implementing restrictions or giving access — for teams and individuals. This security step gives your organization much more control over who can access particular types of data and what level of functionality they can use in your logging and pipeline tools.

Use role-based access control to manage your teams’ access to data, so they can only see what they need to perform their jobs. Instead of each team being responsible for getting its own data into the tools of its choice, an observability pipeline centralizes the management of data and helps ensure people can only access what they need to do their jobs.

9. Collect Data FROM Rest APIs for More Comprehensive Analysis

Getting a full view of your environment often means analyzing data that comes from sources other than traditional event streams. An observability pipeline can help you easily collect data from Representational State Transfer Application Programming Interfaces (REST APIs) and other sources in real time or for ad hoc, batch analysis — which formats this data for use by any analytics tool.

10. Better Understand Your Observability Data With a Robust and Intuitive Management Interface

Reduce management overhead, with a robust and easy-to-use Graphical User Interface (GUI)-based configuration and testing interface. Capture live data and monitor your observability pipeline in real time. Gain visibility into your data.

Free Observability Training

The Stream Sandbox lets you experience a full version of Stream LIVE right now with pre-made sources and destinations. The main course, Stream Fundamentals, will guide you interactively through the main features of Cribl Stream, and upon completion, you will earn a completion certificate.

Fundamentals uses the actual product – cloud-hosted, on-demand, and with its own event generator – no need to configure anything! Fundamentals walks through use cases like Routing and Data Reduction, and through important concepts like sources, destinations, routes, pipelines, and functions.

Next, move on to the Affordable Log Storage in S3 course.This 30-minute course walks through a simple use case of connecting a Splunk Universal Forwarder to S3. With this technique, Stream makes it affordable to store much larger volumes of data in more affordable destinations like S3. This course will show concepts like Sources, Destinations, Partitioning Expressions, Routes, Parsing, and Lookups.

Try Stream – A Free Observability Tool

The fastest way to get started with Cribl Stream is to sign-up at Cribl.Cloud. You can process up to 1 TB of throughput per day at no cost. Sign-up and start using Stream within a few minutes.

Do a little math.
Impress your boss.

The Stream ROI Calculator gives you an easy way to calculate your annual savings with Stream. Be a hero; give it a try.