Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

How Cribl Search Can Save You From Drowning in a Deluge of Data
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Avoid Vendor Lock-in ›

Agent Consolidation ›

Slash Storage Costs ›

Free Up Space for High-Value Data ›

Route From Any Source To Any Destination ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

July 31 | 10am PT / 1pm ET

Navigating the Data Current Report: Transforming IT & Security Operations in 2024
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

Putting Customers First and Amplifying Our Core Values

Written by Brandon McCombs

December 19, 2022

Cribl places high importance on its core values of Customer First, Always; Together; Curious; Irreverent but Serious, and Transparent. We strive to embody these values every day, and a particular customer issue recently enabled us to exemplify them to that customer. Recently, the Cribl Support, Software Engineering, and Product Management teams worked together with our largest Cribl.Cloud customer to resolve throughput issues that arose when integrating Cribl.Cloud with Azure Event Hubs (EH).

We practice first principles thinking to ensure we don’t miss anything when we troubleshoot customer issues. In this case, the first step was to ensure their Azure Event Hub itself was tuned properly so that maximum throughput would be achieved once we lifted the bottleneck the customer discovered when using Cribl Stream nodes in Cribl.Cloud. The customer had already tested various configurations of partition quantities and pricing plans in an attempt to determine the bottleneck, but nothing seemed to help. Before we could tune EH and Stream, the Cribl Support team worked with the customer to conduct a sizing analysis to confirm their needs.

First, we asked what ingress rate they needed for moving data into the EH; the answer was 35 MBps, which told us the minimum required egress rate. But that was just to meet their current needs; it needed to scale to at least 80 MBps by year-end. And as we conducted further analysis of their data volume, we determined the need was actually for 100 MBps for a single topic. Ultimately Cribl decided to target 110 MBps to account for fluctuations, which was far greater than the throughput rates the customer was actually achieving. For initial testing and before opening the support case, the customer had deployed a Cribl.Cloud hybrid worker in Azure Compute and was achieving ~75-80 MBps with a single 4 CPU hybrid Stream worker, but when they used Cribl.Cloud native worker nodes in AWS, the throughput was only ~50 MBps with 15 CPUs. We were tasked with determining the cause of the discrepancy between the Azure Compute and AWS deployments.

Cribl Support engineers provided recommendations based on factoring in all components involved in the data flow. Our first recommendation was to ensure that the Azure Event Hub namespace used Azure’s Premium Plan (the customer had started with the Standard plan), with sufficient Processing Units (PUs). We also recommended scaling the Event Hub to use many more partitions.

Detailed Analysis

First, we considered the customer’s ingress/egress requirements for the EH. Their future EH ingress/egress rate of 110 MBps (9 TB/day ingress) was going to require the Event Hub Premium pricing plan because the Standard plan only supports up to 1 MBps ingress and 2 MBps egress per Throughput Unit (TU). Additionally, the Standard pricing plan limits TUs to 40, which equates to 40 MBps in + 80 MBps out and therefore wasn’t sufficient. The Premium plan also increases the maximum partitions to 100 and raises throughput limits to 10 MBps per PU on egress. Cribl calculated a minimum of 11 partitions were required based on 10 MBps egress to achieve 110 MBps, but this was based solely on Azure docs. However, we can’t size the Event Hub solely based on one side of the equation; we also have to factor in Cribl’s sizing guidelines for Stream.

Since the ARM guideline for each worker process is a recommended maximum of 2.84 MBps (240 GB/day) ingress, we would need to raise the topic’s partition quantity from 11, the minimum dictated by Azure’s parallelism, to about 40 in order to sufficiently distribute the load across worker processes for supporting 9 TB/day ingress while avoiding backpressure.

Next, we factored in egress from Stream. Our customer is sending events to two destinations so this increased the partition count again because that 9 TB/day translates to a minimum of 18 TB/day on egress, for a total of 27 TB/day (ingress + egress x 2). So the ingress rate of 2.84 MBps is now reduced by 50% to allow for 2x volume at egress. The math now necessitates a total of 59 partitions in order to sufficiently spread data over worker processes to avoid saturating them.

Additionally, the events being collected were very large Microsoft Defender service messages that must be unrolled in a Stream pipeline. This process explodes the event count (frequently a 1:50 ratio, or higher, was seen with their data) and therefore subsequent processing requirements, so we added a few more partitions to spread the load even further among Stream processes to arrive at a round number of 70. We presented this figure to our customer; using this information, they configured their EH cluster with 80 partitions, then conducted additional tests. They did achieve much higher throughput than they had previously up to that point but were still shy of the requirement.

At this stage, Cribl Support attacked the problem from a different perspective to determine if the bottleneck was due to a much lower level cause, such as TCP settings, that differed between the customer’s hybrid worker and the native Cribl.Cloud workers. We conducted many tests to isolate variables but were unable to determine the root cause for the throughput differences.

Bringing in Reinforcements

Support now determined we needed some reinforcements in the form of Cribl.Cloud SREs so they could scale up the customer’s Cribl.Cloud environment in case there was an unexpected bottleneck at the infrastructure level. It had already been scaled up slightly a month or two prior but we were re-evaluating the worker node quantity in case we missed something. The SREs also assisted with troubleshooting from the infrastructure perspective. Although the SREs found some oddities in the network traffic between the Cribl.Cloud workers and Azure Event Hubs, they didn’t find anything that definitively and sensibly explained why throughput was higher for Azure cloud worker nodes than for AWS worker nodes. Some general theories suggested that networking (AWS <> Azure) might be at fault, but we still did not have specific and actionable answers. If we knew anything for certain at this time, it was all the things that the problem could not be but we still weren’t sure what the problem was.

A Possible Light at the End of the Tunnel

At about two months in, the customer began escalating the issue. Cribl Support then enlisted the software engineering team to begin analyzing the code used for collecting from Event Hubs to see if there was a defect. The software engineers didn’t find any blatant bugs but made some modifications to expose more settings to see if tuning those would make any difference. They also changed the code to use the native Azure Event Hubs library in case the KafkaJS library currently in use was at fault. The engineers then began conducting what became numerous internal test runs to see how much of a throughput difference the newly tunable parameters and new library made. As a result of that testing, they began to see improvements.

The software engineering team’s results indicated the Azure EH native library combined with specific values for the newly exposed settings achieved much higher throughput results that were about 3x more than our highest throughput to date. This was exciting news, and the customer was happy to see more progress. However, the excitement was short-lived because those throughput results weren’t consistent once testing was expanded to other Azure regions and especially between Cloud providers. Our software engineers discovered drastic throughput differences not just between AWS and Azure but even inter-region within Azure. Testing showed the highest throughput was achieved when worker nodes were deployed in the same region as the Event Hub but a drastic reduction in throughput occurred using any region other than Azure Central. We couldn’t explain this. Azure documentation couldn’t explain this. Our software engineers had started uncovering a more systemic problem with Azure Event Hubs.

Reaching Outside Cribl

We knew we wouldn’t be able to solve this problem without help from Azure Support. One of our recent SRE hires reached out to some contacts he had at Microsoft; at the same time, our customer reached out to their Microsoft Support contacts. As a result, Cribl Support and Engineering were able to communicate with their counterparts in Azure services. We shared our test results with them but had mixed progress and little feedback from that first interaction. Subsequently, both our SRE resource and our customer pushed harder within Microsoft to get us in touch with different people who could make waves.

Within a few days, another call was scheduled with Microsoft Azure Support. We explained our test results again. We spoke to different people this time who better understood the problem but still couldn’t explain our results so they initially tried to place blame on network latency. But our tests showed that throughput was bad even between regions within Azure for which Azure’s latency chart showed latency should be minimal (i.e., < 10 ms). We convinced them to do their own testing. And our Support management was adamant when requesting a follow-up meeting within two days that brought Microsoft, Cribl, and our mutual customer together to ensure continued progress. At that meeting, the Azure personnel stated they had performed their own tests and discovered the potential problem.

Finally!

So what was the solution? Azure EH engineers stated they made a change to increase the TCP socket receive buffer size within the C# .NET code that runs the backend for all Event Hubs brokers. The default value they used for that server-side setting was very small. This change provides for much larger payloads across Azure regions and between Azure and other cloud providers. It makes each poll to the Kafka broker receive more data in response, which makes the communication more efficient, and it is necessary because our customer requires at least 110 MBps. Azure Support told us that this change would be applied to all of their Event Hubs customers in all namespaces and regions!

Before this change, the best throughput we saw was about 188 MBps, and that was confined to a Stream worker node in Azure Central polling an Event Hub also located in Azure Central. With the backend change applied, the throughput increased to about 470 MBps, which was achieved across Cloud providers.

In summary, this was one of the semi-rare moments when three separate companies had to band together to achieve a common goal. We received amazing attention from Azure Support. On top of that, Cribl had numerous personnel engaged every day from Engineering, Support, PM, and SREs working together to arrive at a solution.

This problem intrigued us because we’d never seen anything like it before. We didn’t expect that our findings would result in backend changes to the Event Hubs service that could benefit every Event Hubs customer. But because our customer was affected, and it was such an interesting problem, it motivated us to find the root cause. We were also driven by our customer first mentality because their success is our success. Our customer was patient and willing to work with us through the entire journey because they could see we were making progress through our daily updates via email, Slack, and phone; and, good news for us, they realized based on our hard facts that the root cause was on Microsoft side rather than Cribl. Another customer success story in the bag!

The fastest way to get started with Cribl Stream, Edge, and Search is to try the Free Cloud Sandboxes.

Return to Cribl Blog

Additional Reading

How Cribl Helps the UK Public Sector Manage Challenges Around Growing Data Costs and Complexity

Berwyn Jones Jan 18, 2024

My Summer as a Cribl Intern: 3 Things I’ve Learned

Jacey Dossola Aug 18, 2023

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Download Library

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us