Six techniques to control log volume

February 12, 2024

Log Analytics tools are one of the most expensive categories of observability and security tooling, and costs are continuing to increase dramatically

Introduction.

Log Analytics tools are one of the most expensive categories of IT and security tooling, and costs are continuing to increase dramatically. IDC expects that 129,361EB of data will be generated in 2023 and will more than double to 291,122EB by 2027, representing a 2022-2027 CAGR of 22.4%. This is accompanied by an explosion of log data being collected and stored. With costs at large customers already in the 10s of millions each year, increasing at double-digits year over year means material costs in the millions.

Log Analytics system costs are a function of the volume of data they are ingesting and storing. Historically, the amount of infrastructure required for a log analysis system scaled linearly with the amount of data stored. Newer innovations separate storage and compute in these systems which allows for longer retention with less cost, but in every architecture, more data requires more compute to process the data at query time. Quite simply, more data, more cost.

At the same time, administrators of these systems have long known that much of the data they are asked to store is junk. Error storms or just bad logging often result in repeated messages with duplicative content. Legacy systems often encode high volume metric type data as logs. Compliance often drives treating log analysis tools also as a system of record, where data which is never queried is stored in the log analysis tool in case of a breach or lawsuit. Individual events often contain superfluous information, usually in the form of key value pairs in structured logs which are set to null or are not valuable to the consumer. All of these add up to a ton of wasted ingest, driving up the cost of log tooling dramatically while providing little value.

With a little bit of work, administrators can often dramatically trim ingestion rates. We’ve observed organizations achieving 25% reduction in daily ingestion volume with relatively low effort, and more aggressive programs can hit 50% or more depending on the environment and types of data. Let’s review a few techniques for how these administrators are achieving savings.

1. Filtering.

The oldest and simplest technique for controlling log volume is simply to not store it at all. Most log analysis systems have some kind of mechanism for dropping data administrators find not to be valuable, although the techniques for selecting which data to be dropped often leave a lot to be desired.

2. Routing to lower cost storage.

Log Analytics systems are very expensive systems of record. Storing data in an inverted index balloons storage requirements, and software license costs can be a major factor. Data collected for compliance or breach investigation use cases without day to day use cases is probably best stored in low cost object storage, like Amazon S3 or MinIO. Rather than sending to a log analysis tool, administrators can divert data to an object store, stored in well organized directories. Data can then be ingested later into their log analytics tool of choice or into query environments like Amazon Athena or Snowflake.

3. Dropping unnecessary fields.

The developer who originally builds a log message has a strong incentive to stuff that message with as much information as they have, to avoid having to go back and add additional information later. However, as a consumer of this data, especially from devices and software outside of the organizations control, much of this information is unnecessary. Cisco eStreamer has dozens of key value pairs in every message which are set to NULL or N/A. Microsoft PowerShell logs contain a full copy of the script in every log message. Preprocessing this data to remove unnecessary fields from these messages can result in dramatic savings.

4. Converting logs to metrics.

Many of the highest volume data sources are really metrics in disguise: web activity logs, network flow logs, or custom application telemetry. If a measurement value in the log is the reason for ingesting it, it makes sense to aggregate those logs into summary metrics. Metrics can still be stored in a logging tool, often with dramatic reduction in event counts and data volume, or metrics can be sent to a dedicated time series database like InfluxDB, Circonus or DataDog for efficient storage and retrieval. Some logging tools such as Splunk also provide efficient metrics storage as well.

5. Deduplicating event streams.

Log data contains a lot of duplicative information. In a distributed system, many worker processes will emit errors based on the same underlying problem. Many applications will emit a series events which are useful but often more frequent than is necessary. Deduplicating these streams to emit less frequent messages, along with a count of dropped messages, can still give the investigator the same data at a significantly lower volume of data.

6. Sampling.

Sampling is a technique used under the covers by many analysis systems to speed queries or provide fast summary views. During an election, pollsters don’t call every voter, they call a representative sample to get a view on the electorate with an estimated margin of error. Sampling can be used to great effectiveness on high volume data streams to give a representative view, still allowing for read time aggregation, while storing a fraction of the original events.

Implementing these techniques.

Log Analytics tools often provide support for some kind of filtering, but most of the rest of these techniques will require some kind of pre-processing step. Our recommendation is of course our product, Cribl Stream™, which helps our customers implement all of these techniques and more. Below are some examples.

A global financial services firm wanted to use Splunk to analyze DNS logs, but adding a terabyte of data per day to their existing license was deemed too expensive. By using Stream to enrich data with a top internet domains list, they were able to drop uninteresting logs from trusted domains and reduced one terabyte per day to about 50 gigabytes – well within their budget.

A leading Asia-Pacific energy supplier wanted to collect and send data from 11 offshore oil platforms. Bandwidth at these remote locations made that nearly impossible. They chose Stream to filter, aggregate, and sample data before sending it compressed over their satellite links back to their centralized log system for analysis.

In addition to Stream, there are a number of different tools we’ve seen implement these techniques; check out our post on building an observability pipeline on top of open source technologies for more information on build options.

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

get started

Choose how to get started

See

Cribl

See demos by use case, by yourself or with one of our team.

Schedule a demo

Interactive Demo Center

Try

Cribl

Get hands-on with a Sandbox or guided Cloud Trial.

Try a Sandbox

Start a Cloud Trial

Free

Cribl

Process up to 1TB/day, no license required.

Cribl.Cloud account

Download

Six techniques to control log volume

Introduction.

1. Filtering.

2. Routing to lower cost storage.

3. Dropping unnecessary fields.

4. Converting logs to metrics.

5. Deduplicating event streams.

6. Sampling.

Implementing these techniques.

Choose how to get started

See

Cribl

Try

Cribl

Free

Cribl

Products & Services

Learning & Resources

Company

Get Started

NewsLetter

4.7