Cribl has closed our Series B funding from Sequoia Capital! Learn More

…Like a Multi-Tool For Your Observability Pipeline

Steve Litras
Written by Steve Litras

April 14, 2020

 

In my last post, I focused on a specific use case for routing observability data: separating retention from analysis.  That’s just one of the many tools that become available to you by inserting a routing mechanism into your observability pipeline, and in this post, I’m going to take a look at a number of other capabilities that processing log data “in the stream” can provide.

Supporting Multiple Analysis Tools

The IT team uses one tool for log analysis and another one for metrics. The security team uses yet another tool for Security Information and Event Management (SIEM), and the development teams have additional tooling for product logs, errors and metrics. Unfortunately, each one of these tools has its own mechanism for ingesting data, and they’re isolated from each other, leading to multiple “agents” being installed on systems just to feed the tools, each of which have their own overhead.

Imagine reducing that agent count down to 1 or 2, and being able to feed *all* of the tools from a single pipeline, transforming data as appropriate for each source. This now feeds the network device data to the developers’ tooling, allowing the developers to correlate app errors with the servers switch port flapping, leading to quick resolution. Suddenly, where two tools were reporting different values for specific metrics, they’re now showing the same value, simply because each tool is getting the same data.

Data Enrichment

Context is King. A lot of the data we get, in the form of logs, is barely useful without context. Take the port flap mentioned above. A flapping port doesn’t matter unless it’s connected to something important, and the log entry for that port flap is not going to tell you what it’s connected to. What if you could add data from your CMDB to that line, like the server that port’s connected to, the application that it runs, and the business process that it supports? Now you’ve got the context you need to understand the impact and respond accordingly.

Or, say you have a huge amount of one kind of log data, but you only care about a subset, based on external information, and ingesting all of it into your analysis system is prohibitively expensive? This is exactly the situation that one of our customers found themselves in: they had too much DNS log data to ingest, but they really only cared about the subset of that data that didn’t match “trusted” domains, so they enriched the data with a list of trusted domains, and filtered out records from those domains, only ingesting the log data they needed for analysis. As a result, this reduced their ingestion requirement by orders of magnitude, making it an affordable approach for them.

Another great use case is adding GeoIP information to the data as it comes in. Sure, you can do that at search time in Splunk, but if you have multiple tools, you have to figure out how to do that in all of them. If you do that lookup before sending it to the downstream systems, it only has to be done once, and all downstream systems benefit from it. Less maintenance and consistent results across the board.

Metric Generation

Often, log files contain incredibly valuable information, but it needs to be extracted from the log entry and aggregated to be valuable. Weblog entries, for example, are rarely individually valuable. While what someone is looking for might vary, its usually the metrics about access that matter, not the individual accesses. For example, let’s say you have 1000 lines of weblog data, similar to this:

128.241.220.82 - - [03/Apr/2020:20:30:05 +0000] "GET /static/jquery.js?&JSESSIONID=SD2581716739$SL2122330098FF8932042391ADFF3720110694 HTTP/1.1" 200 2484 "/cart.do?action=view&itemId=EST-16&product_id=MC-SANDISK-MICROSD16GB" "Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3"
64.66.0.20 - - [03/Apr/2020:20:30:05 +0000] "GET /product.screen?product_id=HS-MONST-NERGY&JSESSIONID=SD1837132548$SL4493168124FF7003251314ADFF2222394401 HTTP/1.1" 404 3818 "/product.screen?product_id=BT-HS-JAWB-ICONTHD" "BlackBerry9300/5.0.0.955 Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/102" 
12.130.60.4 - - [03/Apr/2020:20:30:03 +0000] "POST /category.screen?category_id=ACCESSORIES&JSESSIONID=SD8687719920$SL6155682857FF6085796020ADFF1246778254 HTTP/1.1" 400 2967 "/product.screen?product_id=CC-T11-ZAGG-FOLIO" "Mozilla/5.0 (iPad; U; CPU OS 4_3_5 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8L1 Safari/6533.18.5" 
130.253.37.97 - - [03/Apr/2020:20:30:05 +0000] "POST /product.screen?product_id=BT-SP-JAWB-JAMBOXBIG&JSESSIONID=SD8401052943$SL2691867954FF9065133477ADFF6965824981 HTTP/1.1" 404 722 "/cart.do?action=addtocart&itemId=EST-12&product_id=BA-HTC-REZOUND" "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; de-de) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5"
194.8.74.23 - - [03/Apr/2020:20:30:05 +0000] "GET /cart.do?action=changequantity&itemId=EST-18&product_id=BA-MOPHIE-JUICEPACKPLUS&JSESSIONID=SD8190965089$SL7522258463FF7229085117ADFF6846367911 HTTP/1.1" 200 3758 "/category.screen?category_id=MEMORYCARDS" "Mozilla/5.0 (iPad; U; CPU OS 4_3_5 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8L1 Safari/6533.18.5"
125.17.14.100 - - [03/Apr/2020:20:30:05 +0000] "GET /static/6051.jpg?&JSESSIONID=SD1073290485$SL6642531837FF5469045339ADFF7796274172 HTTP/1.1" 200 846 "/category.screen?category_id=CHARGERS" "Mozilla/5.0 (Linux; U; Android 2.3.4; en-us; T-Mobile G2 Build/GRJ22) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
130.253.37.97 - - [03/Apr/2020:20:30:04 +0000] "GET /product.screen?product_id=AC-MOTO-HOTSPOT4G&JSESSIONID=SD8576365728$SL9394190596FF4303629878ADFF2344394698 HTTP/1.1" 200 2410 "/category.screen?category_id=BLUETOOTH" "Mozilla/5.0 (Linux; U; Android 2.3.4; en-us; DROID3 Build/5.5.1_84_D3G-55) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
27.175.11.11 - - [03/Apr/2020:20:30:02 +0000] "GET /static/9403.jpg?&JSESSIONID=SD7103245756$SL6669302782FF9250881909ADFF9216942956 HTTP/1.1" 200 3346 "/category.screen?category_id=BLUETOOTH" "Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3"
195.69.160.22 - - [03/Apr/2020:20:30:03 +0000] "GET /category.screen?category_id=CASES&JSESSIONID=SD5212008800$SL9669846961FF8508958439ADFF3355227402 HTTP/1.1" 200 3399 "/category.screen?category_id=BATTERIES" "Mozilla/5.0 (Linux; U; Android 2.3.4; en-us; DROID3 Build/5.5.1_84_D3G-55) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
86.9.190.90 - - [03/Apr/2020:20:30:05 +0000] "GET /category.screen?category_id=BATTERIES&JSESSIONID=SD5330660580$SL5721426140FF1739646253ADFF1837268871 HTTP/1.1" 503 3561 "/category.screen?category_id=CASES" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_5 like Mac OS X; en-gb) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8L1 Safari/6533.18.5"

 

If you’re really just interested in how many times each page, you can just summarize a count of the hits, grouped by the page URI (minus URI query strings), and filtering out images that are embedded in the pages:

/category.screen 220
/product.screen 246
/static/jquery.js 38
/cart.do 30

 

Say you also wanted to get a break down of the same accesses by the location of the requestor. You can use a GeoIP tool, like the MaxMind GeoIP database, to look up the locations of the requestor’s IP address, enrich the data with the result, and then summarize a count of hits grouped by requestors country:

US 403
UK 234
Korea (South) 112
India 213
Spain 18
Bahamas 20

 

By extracting the values and creating the aggregates “in the stream”, the needed metrics are readily available. As a result, you can just send the aggregated metrics to the analysis/reporting system instead of the full logs. Need the metrics data in multiple tools? The data can be delivered to each one in the format it expects, like Splunk metrics or statsd formats.

Data Cleansing/Reduction

No two ways about it, logs are noisy. In the application development world, it’s far more expensive to have to go back into the code to add new elements to logging than to simply log everything up front. Unfortunately, that means you end up with a lot of info in the logs you don’t want. For example, look at the following excerpt from an AWS API Gateway log entry:

{
  "resource": "/done",
  "path": "/done",
  "httpMethod": "POST",
  "queryStringParameters": null,
  "multiValueQueryStringParameters": null,
  "pathParameters": null,
  "stageVariables": null,
  "requestContext": {
    "resourcePath": "/done",
    "httpMethod": "POST",
    "identity": {
      "user": null
    },
  }
}

The highlighted lines show fields with null values, which provides marginal, if any, value in analysis. Removing those null value fields, reduces the data ingested into the analysis system(s). While it may not seem like much, as you scale up, it adds up very quickly. If retention has been separated from analysis, you’ll also have the freedom to cut out any fields you don’t think are valuable to your analysis, or even whole records. Of course, you’ll want to be careful, since you may find use for the removed fields later. Since you have the raw logs, though, you can always re-ingest the data.

Scratching the Surface

Though there are some great use cases in here, it’s just scratching the surface. I’m sure each of you reading this have a unique need that these kinds of capabilities can help solve. Our product, Cribl LogStream provides these capabilities, and I encourage you to take a drive through our interactive sandbox environment to see how LogStream could help with those needs.

 

Additional Reading

Questions about our technology? We’d love to chat with you.