No company starts out with the idea that their systems should be complex to the point of chaos. It happens organically, through a series of legitimate decisions. Take, for example, a fictional company, “Acme Corp.” Acme is a company that makes everything and has an operations team that desperately needed a log management system, so they went out and set up Elasticsearch, installed a Beats agent on each application’s servers, and started basking in the glory of their newfound observability.
A little while later, Acme’s security team decided that they needed a SIEM and a log management solution, and decided that Splunk was a better fit for them, so they bought it, stood it up, and had the infrastructure team start sending logs from firewalls, switches, and routers to it. Effectively, the two teams have now created two data silos, but as long as they don’t have to intermingle, it doesn’t cause any problems…
From Local Solution to Rube Goldberg Machine
At some point, the security team realizes that they’re missing system data and want a subset of the data that’s in Elasticsearch. Around the same time, the ops team realizes that they need some of the switching data in that’s in Splunk. So each team starts providing data via some copy mechanism to the other.
Eventually, management decides that they want to use InfluxData as their metrics platform, but the operations team is swamped, so someone decides to feed Influx the data that’s native to Splunk as well as the data that’s been copied over from Elasticsearch, making the environment look something like this:
We’re starting to see the makings of a Rube Goldberg machine here, with some significant unintended consequences:
- Overhead – additional tax of extracting and copying the data.
- Latency – additional hops mean data takes longer to get into end systems.
- Dependencies – the Influx data now not only has dependencies, but a chain of dependencies, potentially making data flow brittle.
- No clear “Source of Truth” – data transformation might happen in multiple places, making it a challenge to understand what’s happened to the data.
The Advantage of Stream Processing
However, introducing an observability pipeline into the stream of data, before delivering the stream to any of the systems, provides incredible flexibility, and removes overhead from your analytics systems. Additionally, if that pipeline is optimized for working with log and metric data in a streaming model, it can do it all faster and with less overhead.
Each end system has different requirements for the data that passes through it – Splunk expects a
_raw field, Elastic expects a message field, etc. Data from logs is unstructured and lacks context.
Some log analytics tools give you tools to deal with data structure, mostly in the form of clear text search and field extraction, but those tools vary widely. Adding context to data varies widely – some, like Splunk, can enrich data at search time; while others, like Elasticsearch, need the enrichment done at ingestion. It’s incredibly easy to get into a situation where the same data sets in two different systems don’t match each other.
Enriching data before it ends up in any end system ensures that you only have to do the work once, and can reap the benefits in all of the systems. Cleansing data “in the stream” provides a mechanism to gain consistency across the different systems. Being able to cleanse data before ingestion into the end systems helps you ensure data quality across systems, as well as minimize the amount of data you ingest into the systems (which can directly impact your system costs).
Taming the Chaos with LogStream
Though all of the concepts discussed in this article can be implemented with open source software and plenty of your time, we recommend our product, Cribl LogStream. LogStream provides a unified observability pipeline for all of your log and metric data. A single control plane allows you to manage data quality and context enrichment, ensuring consistency across the end systems. You can send data to ElasticSearch, Splunk, and InfluxDB (as well as many other destinations) that is specifically optimized for each of those platforms. Using our Acme example as a model:
LogStream can parse each event coming in from any of the source systems we support, and send it off to an archival store, like AWS S3. At the same time, it might strip security or PII data from the data forwarded to Elasticsearch, and send only security-related events to Splunk. Simultaneously, LogStream can extract metrics data from the event, and feed that directly to Influxdata. All of Logstream’s configuration is done in its UI and versioned using Git, which makes it easy to track changes.
The march to chaos is real. Most companies I’ve talked to have some variety of this problem, causing various levels of pain. If you find yourself in this situation, I suggest you take a look at the Cribl LogStream product. The best way to do this is to take an hour or two and run through the Logstream Fundamentals course on our interactive sandbox environment.