In this live stream discussion, Eugene Katz and I explain the importance of a quality reference architecture in successful software deployment and guide viewers on how to begin with the Cribl Stream Reference Architecture. They help users establish end-state goals, share different use cases, and help data administrators identify which parts of the reference architecture apply to their specific situation. It’s also available on our podcast feed if you want to listen on the go. If you want to automatically get every episode of the Stream Life podcast, you can subscribe on your favorite podcast app.
The Cribl Stream Reference Architecture serves as a starting point for incorporating our vendor-agnostic observability pipeline into your existing IT and Security architecture. We know firsthand how difficult it can be to onboard and deploy new tools — mistakes were certainly made when we launched back— so we designed this information to help you get 70-80% of the way to a scalable deployment of our flagship product, Cribl Stream.
It’s impossible to account for all the variability in IT, but this framework should be a useful tool in helping set up your particular environment and avoid a lot of pain points as you grow. Keep in mind that applying the considerations here within the context of your network and security architecture is just as important as any of the technical guidance.
Establish Your End State Goal First
The most important thing you can do with any new deployment or takeover of an existing deployment is to define your end state at the beginning. For something mission-critical — like your logging, telemetry, or especially security logging — you have to decide on your business objective before anything else.
Let’s say you want a scalable platform that can survive failure to a certain level — what is that level? It’s good to know the average amount of data that gets processed on a good day, but what happens on a bad day? This is a very important discussion to have with your business leaders because it’s essential for your telemetry and security to work when everything’s going badly. You have to be able to reverse engineer how many cores, systems load balancers, etc. you’ll need to have in place — otherwise, you’re just picking a number out of thin air and rolling the dice. You could also miss out on an opportunity to align with your capacity team on the amount of hardware you’ll need.
General Sizing Considerations and Planning for Failure
CPU
We generally recommend allocating one physical core for each 400GB/day of IN+OUT throughput. For virtual cores, you’ll need 200 GB/day, but it’ll still be the same number of worker processes. There are more details in our Sizing and Scaling documentation for Graviton vs Intel-based work processes, as well as recommendations for which VMs to choose for AWS or Azure deployments.
As far as headroom for handling data spikes goes — that’s where distributed deployment comes in. You’ll distribute not only across the different worker processes and individual worker nodes, but you’ll also have multiple worker nodes and scale out horizontally.
With Stream, you can not only pass all of your data through it, but you can also process your data along the way. You can account for more regex or turn Windows XML into JSON by using the pipeline profiling feature to run a sample and see how long the expression might be taking — just note that variations will depend on each user’s specific situation.
Memory
Big aggregations or large lookups get loaded into memory for each worker process and take up space, and each worker process gets about 2GB of memory by default. We learned about this the hard way — when we started loading in those giant lookups we suddenly started eating a whole lot more memory.
JSON is more CPU-bound than a memory-hungry application, but as you expand your use cases, you’ve got to be ready to add more memory and resources as appropriate.
Disk Size, Speed & Persistent Cues
Stream offers two different options for writing to disk if you have a situation where one of your destinations is experiencing an outage or slowdown. Instead of losing that data or stopping its flow altogether, you can set up a source-persistent or destination-persistent queue as a temporary solution, and once the destination is ready it will start sending those persistent events in.
Once the destination is restored, the data in a source-persistent queue will go through your whole pipeline, so it will take up a lot of resources as it flows all the way through to the destination. On the other hand, a destination-persistent queue will require fewer resources, because that data has already gone through the whole pipeline.
Destination queues are a great way to have a buffer in situations where you’re gathering data in a data center in another country and passing it into your security data lake before it’s processed. This leaves you with options in the case of failure. This is an area where your original business objectives come in — how will you size your persistent queue? Will you have an hour-long buffer, or maybe a 24-hour buffer? Be sure to think through these situations before they arise.
Connection Management
Managing connections is tough, especially when you’re working with thousands of data sources, universal forwarders, and pieces of network gear that need to be configured. We recommend always having load balancers available if you’re going to be working with agentless protocols like Syslog, TCP Syslog, UDP Syslog, HEC, and HTTP — but make sure you manage that connection overhead and don’t point everything at one server, or you’ll find yourself in a world of trouble.
Once you’re done balancing the load across the different workers, you have to account for the total number of connections — 400 per CPU core is manageable, but it will depend on your EPS. If you have more than 250 connections per core, then you need to start thinking about testing what’s optimal for your architecture. What is your EPS and how sustained is it? How many forwarders do you have? How fast are they writing? Do you have big senders?
Single Worker Groups vs. Multiple Worker Groups
A single, or all-in-one, worker group is appropriate for small-to medium-sized enterprises working with less than or near 1T of data per day. If your sources are small enough to handle spikes or are unlikely to reach capacity, then this type of architecture may be appropriate.
A setup involving multiple worker groups is necessary for larger organizations or if you have sensitive or complex data to process. The first thing that customers will do is split up pull and push worker groups. Push worker groups like data from Syslog in universal forwarders are usually consistent, but the pull side of things can be a different story. Mixing the data you’re pulling down from CrowdStrike, which has a series of huge spikes followed by no data flow, might be problematic.
Your pull sources will also be managed by the leader in terms of scheduling, so you want to make sure that you have those sources fairly close to the leader to avoid running into network latency, and potentially having skipped pulls.
These are just some of the things to consider in the design of your enterprise’s architecture. Watch the live stream on Introducing the Cribl Stream Reference Architecture to get more detail and insights on integrating Cribl Stream into any environment, enabling faster value realization with minimal effort. This is the first of many discussions on the Cribl Stream Reference Architecture, tailored to SecOps and Observability data admins. Take advantage of this opportunity to empower your observability administration skills, and stay tuned for future conversations that will dive deeper into each of the topics discussed here.
The fastest way to get started with Cribl Stream, Edge, and Search is to try the Free Cloud Sandboxes.