Cribl Reference Architecture Series: Scaling Effectively for a High Volume of Agents

Last edited: September 18, 2023

In this livestream, Cribl’s Ahmed Kira and I explore the challenges of scaling your Cribl Stream architecture to accommodate a large number of agents, providing valuable insights on what you need to consider when expanding your Cribl Stream deployment.

Managing data flows from a high volume of agents presents a unique set of challenges that need to be addressed. Organizations need to meet business resiliency requirements and ensure the reliable transmission of data from endpoints to their analytics systems.

The Cribl Stream Reference Architectures can help you set up your infrastructure to handle those high-volume sources. Architectural considerations for Cribl environments are typically centered around daily volumes of data — but if you have tens of thousands of agents communicating directly with Cribl workers, you also have to consider the ratio of agents to Cribl worker processes.

In today’s distributed world, data comes from everywhere — from servers to workstations, laptops, and IoT devices. Every one of those agents is establishing, or opening and closing, TCP connections. But a process on a Linux host can only handle so many TCP connections coming in, so keep a close eye on your connection overhead.

Partitioning Workloads

When collecting from a large volume of agents, you want to keep high-volume agents dedicated to their own worker group. Putting these worker groups by location helps reduce latency — if you have a data center with tens of thousands of virtual machines that will be talking to Cribl, that location can be its own worker group.

By separating it and not having Syslog or any other high-volume sources on the same worker group, any changes you make won’t affect any other protocols. You also get the ability to fail small — in a single data center instead of in many of them. Separating workloads makes it easier to monitor, manage, update, and scale your deployment.

Be Careful Not to Overwhelm Your Destinations

With this kind of architecture, you want to consider the workload on your destinations. For example, if you have a worker group talking to a Splunk indexer cluster, it generates TCP connections from every worker process to each Splunk indexer. For these kinds of destinations, the max connection setting needs to be tuned so that you don’t overwhelm or create a bottleneck for your destination.

Using Cribl Stream creates better throughput, so bottlenecks will likely be made or moved closer to the indexers and cause a problem. With Stream, there are plenty of options to manage that, but keep this in mind so you don’t just shift problems from one place to another.

Send Data to Multiple Destinations to Comply With Privacy Requirements

With the breakout of different worker groups in Cribl Stream, you have the option to send data to multiple data lakes and to your analytics tools. This could be especially useful for international deployments with different data sovereignty and privacy requirements.

If you’re picking up endpoint data in the EU, then you’ll fall in scope under GDPR. In this scenario, having a workstation in the EU gives you way more options than you would have if you were trying to homerun that data back to the US.

If you’re handling other types of sensitive data like PHI or PII, have some guardrails in place. Put at least two workers in your worker group to account for HA, regardless of how little data is flowing. After you use your calculators to size your deployment, add an extra worker group (N+1) to account for bursts in throughput.

Oversizing for Failure

When you work with the team at Cribl to set up your architecture, they’ll typically recommend sizing to handle 150% of your planned data. But if you have business resilience requirements that require you to sustain more than 1.5 times your daily average, then you need to consider upscaling even further..

We make that as easy as possible from an administrative standpoint, which is one of the reasons I fell in love with Cribl right away as a customer. If you have a worker group and want to add another server, it takes the same code as the other worker groups. Managing fleets and subfleets instead of individual pieces makes things simple — take advantage so you don’t end up pointing all your endpoints at one server.

Determining EPS and How Many Agents/vCPU

One Cribl worker process can handle as many as 5000 very low EPS agents. So if you have a Cribl worker with 14 worker processes on a 16 CPU system, that one Cribl worker can handle all 70,000 agents.

But let’s be honest — how many of your agents generate less than three events per second? Maybe some, but not many. For most deployments, a volume of 30 EPS and 250 agents per vCPU is more appropriate as a baseline. It’s a much lower, but intentionally conservative starting point.

Here are the guidelines based on events per second from different senders — you can find more information in our Multiple Agents Reference Architecture Documentation.

We assume three tiers of “chatty” agents, based on events per second. You’ll probably recognize your senders from these definitions:

Chatty agents (100 EPS/agent) – Size 150 agents/vCPU. (Examples are domain controllers or intermediary agents.)

Medium-chatty agents (30 EPS/agent) – Size 250 agents/vCPU. (Most servers will fall into this medium category.)

Low-volume agents (3 EPS/agent) – Size 5000 agents/vCPU (Examples are workstations.)

For the most accurate sizing, obtain EPS reports from your current observability tools.

Load Balancing Considerations

Load balancer configuration is especially important for agents like Fluent Bit, Fluentd, and others that support HTTP or AGC delivery — because they send data to a load balancer in front of the Cribl workers. Cribl tools work better without sticky sessions, so that data is distributed across different workers.

For the most part, agents support auto load-balancing with their native protocols, so take advantage of that whenever you can. Don’t put data from a Splunk Universal Forwarder through a load balancer using S2S unless you’re completely against getting a nice distribution of data.

Cribl Edge also has a setting for load balancing that makes it easy to get the type of scale you’re looking for. You’ll be able to engage all your workers, get an even distribution, and have the ability to failover if there’s a problem.

Cribl’s Reference Architectures are a starting point to get you 75% of the way towards deploying Cribl Stream. At that point, you can consult with the team at Cribl to adjust for your own unique requirements and make sure all other odds and ends are accounted for. Think about these things before you start so you can get as much value with as few problems as possible.

Watch the full livestream for more details on the keys to achieving a seamless, continuous flow of data from your endpoints to your destinations — including considerations for different amounts of throughput/agents and an example of how we would recommend deploying Stream as a retailer sending data to their SIEM.

The updated Cribl Reference Architecture equips administrators with the tools and guidance to tackle these issues proactively, helping to prevent potential disruptions to your business operations.

Here are some of the other live streams in our Reference Architecture Series to help you get started implementing Cribl Stream:

Cribl, the AI Platform for Telemetry, empowers enterprises to manage and analyze telemetry for both humans and agents with no lock-in, no data loss, no compromises. Trusted by organizations worldwide, including half of the Fortune 100, Cribl gives customers the choice, control, and flexibility to build what’s next.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

Previous articleNext article