March 4, 2019
Logs themselves often do not contain the necessary information in themselves to point an investigator in the right direction. Let’s say I’m troubleshooting a performance issue with my application. I may want to dig through all kinds of data sources, like proxy logs, web access logs or custom instrumentation. But, in these data sources, proxy logs show IP Address, web access logs use hostname and my custom instrumentation contains the process name. How do I analyze data by what’s being reported, which is a particular application slowness?
Different investigations require the analyst to go digging through other systems to get information to key into the search bar of their log analytics tool. Mapping hosts to applications and writing long lists of ORs by hand. Trying to understand which class of customers is having a problem may be impossible. Knowing a security group accidentally left a port open is a far cry from knowing whether any traffic tried to hit that port.
What’s missing, is context. Meaningful analysis can only be conducted if the analyst is given the ability to follow different lines of inquiry. Understanding the health of an application or its security posture requires examining data from many sources, which need to be able to be correlated and grouped together by the analyst. Logs are missing this context, but the logs often contain the information required to look up this information.
Cribl allows you to enrich events at ingestion time, no matter their destination system. In the Splunk, ecosystem this is a fundamentally new capability. In the Elastic ecosystem, Logstash provides this capability, but it’s so hard to do that few do. For the first time, Cribl provides a UX which allows you to interactively reshape and enrich the data to give you high confidence the data will look the way you want it after it is run through Cribl.
In this post, we’ll walk you through an interactive demo showing how to take an ugly old Web Access Log, structure it, and turn it into a rich business event stream through a Cribl pipeline which:
If you prefer video, here’s the rest of the post in video form. Otherwise, keep reading.
“Wait!” you say, Splunk has provided query time lookups for a long time. This is true, and Splunk’s lookup ability has taken it very far. Query time lookups have some advantages. They do not bloat the stored events, although that inefficiency can be optimized away with a smarter data store. You can update the values returned at any time, no reprocessing of data required.
However, there are downsides. Query time joins consume resources and require redoing the same work for every query. Query time joins can fail to work well at large scale, and they can require replicating around very large caches of state information to every node which services queries. Most importantly, this demo highlights why ingestion time enrichment is a critical capability. In this demo, we lookup a session ID in Redis to map from the session ID to the user who’s session it is. This is temporal operational state that’s maintained by the application. In order for query time joins to offer the same functionality, it is necessary to maintain a table of every session ID to every user for essentially all time … a massive challenge.
To prove out this use case, we’ve built out a demo environment which does lookups against running backend systems to be as realistic as possible. We’ve built these as custom functions in Cribl and they’re available in GitHub. You can run this demo easily using Docker by cloning the
cribl-enrich-demo repo and running
docker compose up -d. Open your web browser to http://localhost:9000/pipelines/parse-and-enrich and use username
admin and password
This environment is fully functional, but by default it’s setup to drop all data. This can easily be changed, and we’d encourage you to setup an output and send the data to one of our destinations, or even just to the filesystem to see how it works.
In this demo, we have a data generator container sending data to Cribl which in turn queries Redis and a custom REST API to access enrichment information. The data generator is generating 10 events per second of weblog entries, which in their raw form look like:
184.108.40.206 - - [18/Feb/2019:07:02:51 +0000] "GET /static/3700.jpg?&JSESSIONID=SD4528254410$SL6599261649FF1314707154ADFF1622727863 HTTP/1.1" 200 1827 "/category.screen?category_id=DISPLAYPROTECT" "Mozilla/5.0 (Linux; U; Android 2.3.3; en-us; Sprint APA9292KT Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1" 142
Log data has a ton of variety. Some logs are streams of human readable text which describe an event that happened. Some logs have a mix of performance information scattered amongst human readable text. Web Access Logs are of the variety that are largely purely structured information, like CSV or JSON, but serialized in a format, called Common Log Format, that at first glance looks like unstructured text. There are many other examples of these types of structured logs, like flow logs or firewall logs.
That raw log line has a ton of structure, but that structure is encapsulated in another structure which contains the metadata of the log event. In Cribl, we represent this as JSON, and it looks like:
"_raw":"220.127.116.11 - - [Jan 03 07:40:42.820480] \"GET /oldlink?item_id=EST-12&JSESSIONID=SD5785900364$SL1443342661FF9130604439ADFF1842236881 HTTP 1.1\" 200 970 \"/category.screen?category_id=BLUETOOTH\" \"Mozilla/5.0 (iPad; CPU OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9A405\" 167",
What we want to do first, is lift up the structure from the weblog into the structure of the event. For this, we’re going to use Cribl’s Parser function, which ships with support for Apache Access Combined format. In your copy of Cribl, you’ll see the first two functions in our pipeline are Parser functions. Let’s look at the first in detail.
There’s a lot of options here, which may look intimidating, but to configure web log parsing from scratch, the following steps were all I needed:
Now, we’re going to use Cribl’s Preview functionality to see if Parsing is actually working the way we want. As the way we ship the demo, all functions except the first are disabled by default so you can turn them on one by one and see their effect. On the right hand side, we’re going to start a capture to grab some events running through our pipeline and see the effects of our Parser function. On the right hand side of the UI, click Preview > Start a Capture.
This will pop open a dialog to run a capture. There are a number of options in this dialog, but we don’t need to concern ourselves with any of them right now. Click “Start a capture”, which will run for a few seconds, and then show you the events it captured from the stream. Once you see events, click “Save”.
Now we have events in Cribl we can interactively work with. On the right hand side, you’ll see events as they come into the pipeline. They’ll have a simple structure,
_time, and a few other pieces of metadata. On the top of the Preview sidebar, click “Out” to see what events are looking like after the pipeline has modified them. You should see a screen like this.
The UI will show you a diff of what’s changed or added to a particular event. We can see the parser has added a number of fields (highlighted in green) it’s extracted from the web log like
useragent, and details about the request. There’s a few additional things we want to do with this data. First, we want to get our Session ID out of the
request_uri_query and into a field to use as the key for a lookup. Secondly, there are a number of numerical fields in the data which are currently strings, and we need to convert those to numbers so that the downstream systems can interpret the data correctly. Enable the next two functions in the pipeline, Parser & Numerify, and click “Save” on the left hand side. Your data should now look like this.
If you expand the second Parser function, you’ll see we’re running our Key=Value parser against the
request_uri_queryfield which adds
JSESSIONID to our event. Our
status are now properly numeric fields as well. Now, let’s enrich our events.
In our hypothetical running application, there is a Redis server which contains mappings of session IDs to users to allow the web servers to be stateless and retrieve that information from Redis. In our log processing pipeline, we’re going to use Redis to do that same mapping to get from a session ID to a username. Once we have the username, we’re going to hit an internal REST API which returns customer information based on a given username.
There are a few things to consider before doing this in production. For every log line you’re going to be reaching out to Redis and hitting a REST API. Given each web session generates dozens to hundreds of web access log entries, you could easily DoS Redis or the API. In the demo functions I wrote, I do no caching, no retries, error handling, etc, so I wouldn’t necessarily call them production ready. That being said, they are a good base to build off of. But, I have personally implemented this exact strategy in production for business event type of data, using multiple summary indexing steps, back in my days as a Splunk customer, so it is a feasible approach and will work at scale with the right planning.
In the Cribl UI, enable the Redis Lookup function and hit save. You’ll notice in the preview, you should now see a
userName field which is being retrieved from Redis. Let’s look at how the Redis Lookup function is configured.
This function really shows the power of Cribl’s expandability. You can look at the source code and configuration of the function if you like to see how it was implemented.
conf.schema.json defines how we want to be able to configure our function. The UI for configuration is automatically generated from the schema.
index.js implements the lookup. In this case, we’re connecting a host named
redis which is also what our Redis docker container is named. We’re going to lookup the value of
JSESSIONID in redis and store the results in
Now that we have the username, we’re going to use that to look up customer information from our REST API. Enable the REST Lookup on the left and hit save. Now you will see we have a new field in our event which is a nested object, called
That’s a huge amount of context! We can now see where our customers are located, gender, and age. With this context, we can now group our access log information by a much richer set of information. To see this work in action, I’m going to load this data into an analytics tool. One of my favorites for looking at performance information is Honeycomb.
In our hypothetical environment, my friend Michael Wilde called me to say my service was running slow. I haven’t heard about this from anywhere else, and my dashboards are all green. I already have a free Honeycomb account, and I want to send my enriched logs into my
enrich_demo dataset. In Cribl, I go to Destinations > Honeycomb > Add New. I enter my dataset and team name. Also, under advanced, I select “honeycomb-shape” for my conditioning pipeline. This pipeline will flatten the customer object to work better in Honeycomb and drop the
_raw fields. After saving, the UI prompts me to make it my default destination, and I select OK to start sending the data out to Honeycomb.
I give it about an hour to get some data going in Honeycomb. To start with, I
HEATMAP(timetaken) calculations to get a look at my performance data.
The P95 looks normal, but the heatmap shows a few outliers. I want to get some ideas of what dimensions might be correlated to the slow performance, so I use their new BubbleUp feature to get a view on a those high
timetaken entries split by a number dimensions.
Looks like all of them are in South Lake Tahoe. At their helpful suggestion I break down by
userName. Username is a very high cardinality field, one of Honeycomb’s differentiators. Honeycomb brings me back data sorted by
P95(timetaken) in descending order, and sure enough right there in the table is my friend Michael Wilde. I hear the weather up in Tahoe is terrible the last few days, maybe they’re all stuck indoors watching Netflix and chewing up all the bandwidth?
Logs are hugely valuable data source, but they’re often in need of additional context that can be found elsewhere in other data sources. Cribl allows you to bring structure to your unstructured logs and use that information to build rich structured events. This context can be hugely valuable for troubleshooting, make it easy to look across your data in ways you hadn’t originally anticipated. There are a number of other data sources we could use enrichment to allow us to look at the data differently.
In all of those examples, we need to join our original log to some other data source to be able to split the data by these fields. Once enriched, these logs enable fundamental new use cases which have been difficult if not impossible previously.