a9c

Context is King: Turning Ugly Logs into Rich Structured Events

Last edited: March 4, 2019

Logs themselves often do not contain the necessary information in themselves to point an investigator in the right direction. Let’s say I’m troubleshooting a performance issue with my application. I may want to dig through all kinds of data sources, like proxy logs, web access logs or custom instrumentation. But, in these data sources, proxy logs show IP Address, web access logs use hostname and my custom instrumentation contains the process name. How do I analyze data by what’s being reported, which is a particular application slowness?

Different investigations require the analyst to go digging through other systems to get information to key into the search bar of their log analytics tool. Mapping hosts to applications and writing long lists of ORs by hand. Trying to understand which class of customers is having a problem may be impossible. Knowing a security group accidentally left a port open is a far cry from knowing whether any traffic tried to hit that port.

a9c

What’s missing, is context. Meaningful analysis can only be conducted if the analyst is given the ability to follow different lines of inquiry. Understanding the health of an application or its security posture requires examining data from many sources, which need to be able to be correlated and grouped together by the analyst. Logs are missing this context, but the logs often contain the information required to look up this information.

Cribl allows you to enrich events at ingestion time, no matter their destination system. In the Splunk, ecosystem this is a fundamentally new capability. In the Elastic ecosystem, Logstash provides this capability, but it’s so hard to do that few do. For the first time, Cribl provides a UX which allows you to interactively reshape and enrich the data to give you high confidence the data will look the way you want it after it is run through Cribl.

enrich-demo-data-flow-1

In this post, we’ll walk you through an interactive demo showing how to take an ugly old Web Access Log, structure it, and turn it into a rich business event stream through a Cribl pipeline which:

  1. Parses the web access log

  2. Looks up the username in Redis based on the session ID in the query string

  3. Queries a REST API to lookup customer information from the username

If you prefer video, here’s the rest of the post in video form. Otherwise, keep reading.

Query Time Lookup

“Wait!” you say, Splunk has provided query time lookups for a long time. This is true, and Splunk’s lookup ability has taken it very far. Query time lookups have some advantages. They do not bloat the stored events, although that inefficiency can be optimized away with a smarter data store. You can update the values returned at any time, no reprocessing of data required.

However, there are downsides. Query time joins consume resources and require redoing the same work for every query. Query time joins can fail to work well at large scale, and they can require replicating around very large caches of state information to every node which services queries. Most importantly, this demo highlights why ingestion time enrichment is a critical capability. In this demo, we lookup a session ID in Redis to map from the session ID to the user who’s session it is. This is temporal operational state that’s maintained by the application. In order for query time joins to offer the same functionality, it is necessary to maintain a table of every session ID to every user for essentially all time … a massive challenge.

Get the Demo Environment

To prove out this use case, we’ve built out a demo environment which does lookups against running backend systems to be as realistic as possible. We’ve built these as custom functions in Cribl and they’re available in GitHub. You can run this demo easily using Docker by cloning the cribl-enrich-demo repo and running docker compose up -d. Open your web browser to http://localhost:9000/pipelines/parse-and-enrich and use username admin and password admin.

This environment is fully functional, but by default it’s setup to drop all data. This can easily be changed, and we’d encourage you to setup an output and send the data to one of our destinations, or even just to the filesystem to see how it works.

Parsing Weblogs

In this demo, we have a data generator container sending data to Cribl which in turn queries Redis and a custom REST API to access enrichment information. The data generator is generating 10 events per second of weblog entries, which in their raw form look like:

Code example
27.96.128.0 - - [18/Feb/2019:07:02:51 +0000] "GET /static/3700.jpg?&JSESSIONID=SD4528254410$SL6599261649FF1314707154ADFF1622727863 HTTP/1.1" 200 1827 "/category.screen?category_id=DISPLAYPROTECT" "Mozilla/5.0 (Linux; U; Android 2.3.3; en-us; Sprint APA9292KT Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1" 142

Log data has a ton of variety. Some logs are streams of human readable text which describe an event that happened. Some logs have a mix of performance information scattered amongst human readable text. Web Access Logs are of the variety that are largely purely structured information, like CSV or JSON, but serialized in a format, called Common Log Format, that at first glance looks like unstructured text. There are many other examples of these types of structured logs, like flow logs or firewall logs.

That raw log line has a ton of structure, but that structure is encapsulated in another structure which contains the metadata of the log event. In Cribl, we represent this as JSON, and it looks like:

Code example
{ "_raw":"125.17.14.100 - - [Jan 03 07:40:42.820480] \"GET /oldlink?item_id=EST-12&JSESSIONID=SD5785900364$SL1443342661FF9130604439ADFF1842236881 HTTP 1.1\" 200 970 \"/category.screen?category_id=BLUETOOTH\" \"Mozilla/5.0 (iPad; CPU OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9A405\" 167", "_time":"1546501240", "host":"web-01.bar.com", "index":"cribl", "source":"/opt/apache/log/access_combined.log", "sourcetype":"access_combined" }

What we want to do first, is lift up the structure from the weblog into the structure of the event. For this, we’re going to use Cribl’s Parser function, which ships with support for Apache Access Combined format. In your copy of Cribl, you’ll see the first two functions in our pipeline are Parser functions. Let’s look at the first in detail.

screenshot-2019-02-18-08

There’s a lot of options here, which may look intimidating, but to configure web log parsing from scratch, the following steps were all I needed:

  1. Click Add Function -> Standard -> Parser

  2. Enter a Description

  3. From Parser Library, select “Apache Combined Log Format”

  4. Since my log has one additional field, I added “timetaken” to the end of the List of Fields

Now, we’re going to use Cribl’s Preview functionality to see if Parsing is actually working the way we want. As the way we ship the demo, all functions except the first are disabled by default so you can turn them on one by one and see their effect. On the right hand side, we’re going to start a capture to grab some events running through our pipeline and see the effects of our Parser function. On the right hand side of the UI, click Preview > Start a Capture.

screenshot-2019-02-18-08

This will pop open a dialog to run a capture. There are a number of options in this dialog, but we don’t need to concern ourselves with any of them right now. Click “Start a capture”, which will run for a few seconds, and then show you the events it captured from the stream. Once you see events, click “Save”.

screenshot-2019-02-18-08

Now we have events in Cribl we can interactively work with. On the right hand side, you’ll see events as they come into the pipeline. They’ll have a simple structure, _raw, _time, and a few other pieces of metadata. On the top of the Preview sidebar, click “Out” to see what events are looking like after the pipeline has modified them. You should see a screen like this.

screenshot-2019-02-18-08

The UI will show you a diff of what’s changed or added to a particular event. We can see the parser has added a number of fields (highlighted in green) it’s extracted from the web log like status, timetaken, useragent, and details about the request. There’s a few additional things we want to do with this data. First, we want to get our Session ID out of the request_uri_query and into a field to use as the key for a lookup. Secondly, there are a number of numerical fields in the data which are currently strings, and we need to convert those to numbers so that the downstream systems can interpret the data correctly. Enable the next two functions in the pipeline, Parser & Numerify, and click “Save” on the left hand side. Your data should now look like this.

screenshot-2019-02-18-09

If you expand the second Parser function, you’ll see we’re running our Key=Value parser against the request_uri_queryfield which adds JSESSIONID to our event. Our bytes, timetaken and status are now properly numeric fields as well. Now, let’s enrich our events.

Enriching Events

enrich_enrich

In our hypothetical running application, there is a Redis server which contains mappings of session IDs to users to allow the web servers to be stateless and retrieve that information from Redis. In our log processing pipeline, we’re going to use Redis to do that same mapping to get from a session ID to a username. Once we have the username, we’re going to hit an internal REST API which returns customer information based on a given username.

There are a few things to consider before doing this in production. For every log line you’re going to be reaching out to Redis and hitting a REST API. Given each web session generates dozens to hundreds of web access log entries, you could easily DoS Redis or the API. In the demo functions I wrote, I do no caching, no retries, error handling, etc, so I wouldn’t necessarily call them production ready. That being said, they are a good base to build off of. But, I have personally implemented this exact strategy in production for business event type of data, using multiple summary indexing steps, back in my days as a Splunk customer, so it is a feasible approach and will work at scale with the right planning.

In the Cribl UI, enable the Redis Lookup function and hit save. You’ll notice in the preview, you should now see a userName field which is being retrieved from Redis. Let’s look at how the Redis Lookup function is configured.

screenshot-2019-02-18-10

This function really shows the power of Cribl’s expandability. You can look at the source code and configuration of the function if you like to see how it was implemented. conf.schema.json defines how we want to be able to configure our function. The UI for configuration is automatically generated from the schema. index.js implements the lookup. In this case, we’re connecting a host named redis which is also what our Redis docker container is named. We’re going to lookup the value of JSESSIONID in redis and store the results in userName.

Now that we have the username, we’re going to use that to look up customer information from our REST API. Enable the REST Lookup on the left and hit save. Now you will see we have a new field in our event which is a nested object, called customer.

screenshot-2019-02-18-10

That’s a huge amount of context! We can now see where our customers are located, gender, and age. With this context, we can now group our access log information by a much richer set of information. To see this work in action, I’m going to load this data into an analytics tool. One of my favorites for looking at performance information is Honeycomb.

Sending and Analyzing Rich Events

In our hypothetical environment, my friend Michael Wilde called me to say my service was running slow. I haven’t heard about this from anywhere else, and my dashboards are all green. I already have a free Honeycomb account, and I want to send my enriched logs into my enrich_demo dataset. In Cribl, I go to Destinations > Honeycomb > Add New. I enter my dataset and team name. Also, under advanced, I select “honeycomb-shape” for my conditioning pipeline. This pipeline will flatten the customer object to work better in Honeycomb and drop the customer and _raw fields. After saving, the UI prompts me to make it my default destination, and I select OK to start sending the data out to Honeycomb.

I give it about an hour to get some data going in Honeycomb. To start with, I add P95(timetaken)and HEATMAP(timetaken) calculations to get a look at my performance data.

screenshot-2019-02-18-20

The P95 looks normal, but the heatmap shows a few outliers. I want to get some ideas of what dimensions might be correlated to the slow performance, so I use their new BubbleUp feature to get a view on a those high timetaken entries split by a number dimensions.

screenshot-2019-02-18-20

Looks like all of them are in South Lake Tahoe. At their helpful suggestion I break down by customerCity, customerState, and userName. Username is a very high cardinality field, one of Honeycomb’s differentiators. Honeycomb brings me back data sorted by P95(timetaken) in descending order, and sure enough right there in the table is my friend Michael Wilde. I hear the weather up in Tahoe is terrible the last few days, maybe they’re all stuck indoors watching Netflix and chewing up all the bandwidth?

screenshot-2019-02-18-20

Wrapping Up

Logs are hugely valuable data source, but they’re often in need of additional context that can be found elsewhere in other data sources. Cribl allows you to bring structure to your unstructured logs and use that information to build rich structured events. This context can be hugely valuable for troubleshooting, make it easy to look across your data in ways you hadn’t originally anticipated. There are a number of other data sources we could use enrichment to allow us to look at the data differently.

  • Analyze AWS logs by security group

  • DNS Logs by threat level

  • Analyze web logs by customer type

  • Access logs for a department

In all of those examples, we need to join our original log to some other data source to be able to split the data by these fields. Once enriched, these logs enable fundamental new use cases which have been difficult if not impossible previously.

We’d love to hear from you on your ideas! Join us in Slack #cribl, tweet at us @cribl_io, or contact us via hello@cribl.io.

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

More from the blog

get started

Choose how to get started

See

Cribl

See demos by use case, by yourself or with one of our team.

Try

Cribl

Get hands-on with a Sandbox or guided Cloud Trial.

Free

Cribl

Process up to 1TB/day, no license required.