December 19, 2022
Cribl places high importance on its core values of Customer First, Always; Together; Curious; Irreverent but Serious, and Transparent. We strive to embody these values every day, and a particular customer issue recently enabled us to exemplify them to that customer. Recently, the Cribl Support, Software Engineering, and Product Management teams worked together with our largest Cribl.Cloud customer to resolve throughput issues that arose when integrating Cribl.Cloud with Azure Event Hubs (EH).
We practice first principles thinking to ensure we don’t miss anything when we troubleshoot customer issues. In this case, the first step was to ensure their Azure Event Hub itself was tuned properly so that maximum throughput would be achieved once we lifted the bottleneck the customer discovered when using Cribl Stream nodes in Cribl.Cloud. The customer had already tested various configurations of partition quantities and pricing plans in an attempt to determine the bottleneck, but nothing seemed to help. Before we could tune EH and Stream, the Cribl Support team worked with the customer to conduct a sizing analysis to confirm their needs.
First, we asked what ingress rate they needed for moving data into the EH; the answer was 35 MBps, which told us the minimum required egress rate. But that was just to meet their current needs; it needed to scale to at least 80 MBps by year-end. And as we conducted further analysis of their data volume, we determined the need was actually for 100 MBps for a single topic. Ultimately Cribl decided to target 110 MBps to account for fluctuations, which was far greater than the throughput rates the customer was actually achieving. For initial testing and before opening the support case, the customer had deployed a Cribl.Cloud hybrid worker in Azure Compute and was achieving ~75-80 MBps with a single 4 CPU hybrid Stream worker, but when they used Cribl.Cloud native worker nodes in AWS, the throughput was only ~50 MBps with 15 CPUs. We were tasked with determining the cause of the discrepancy between the Azure Compute and AWS deployments.
Cribl Support engineers provided recommendations based on factoring in all components involved in the data flow. Our first recommendation was to ensure that the Azure Event Hub namespace used Azure’s Premium Plan (the customer had started with the Standard plan), with sufficient Processing Units (PUs). We also recommended scaling the Event Hub to use many more partitions.
First, we considered the customer’s ingress/egress requirements for the EH. Their future EH ingress/egress rate of 110 MBps (9 TB/day ingress) was going to require the Event Hub Premium pricing plan because the Standard plan only supports up to 1 MBps ingress and 2 MBps egress per Throughput Unit (TU). Additionally, the Standard pricing plan limits TUs to 40, which equates to 40 MBps in + 80 MBps out and therefore wasn’t sufficient. The Premium plan also increases the maximum partitions to 100 and raises throughput limits to 10 MBps per PU on egress. Cribl calculated a minimum of 11 partitions were required based on 10 MBps egress to achieve 110 MBps, but this was based solely on Azure docs. However, we can’t size the Event Hub solely based on one side of the equation; we also have to factor in Cribl’s sizing guidelines for Stream.
Since the ARM guideline for each worker process is a recommended maximum of 2.84 MBps (240 GB/day) ingress, we would need to raise the topic’s partition quantity from 11, the minimum dictated by Azure’s parallelism, to about 40 in order to sufficiently distribute the load across worker processes for supporting 9 TB/day ingress while avoiding backpressure.
Next, we factored in egress from Stream. Our customer is sending events to two destinations so this increased the partition count again because that 9 TB/day translates to a minimum of 18 TB/day on egress, for a total of 27 TB/day (ingress + egress x 2). So the ingress rate of 2.84 MBps is now reduced by 50% to allow for 2x volume at egress. The math now necessitates a total of 59 partitions in order to sufficiently spread data over worker processes to avoid saturating them.
Additionally, the events being collected were very large Microsoft Defender service messages that must be unrolled in a Stream pipeline. This process explodes the event count (frequently a 1:50 ratio, or higher, was seen with their data) and therefore subsequent processing requirements, so we added a few more partitions to spread the load even further among Stream processes to arrive at a round number of 70. We presented this figure to our customer; using this information, they configured their EH cluster with 80 partitions, then conducted additional tests. They did achieve much higher throughput than they had previously up to that point but were still shy of the requirement.
At this stage, Cribl Support attacked the problem from a different perspective to determine if the bottleneck was due to a much lower level cause, such as TCP settings, that differed between the customer’s hybrid worker and the native Cribl.Cloud workers. We conducted many tests to isolate variables but were unable to determine the root cause for the throughput differences.
Support now determined we needed some reinforcements in the form of Cribl.Cloud SREs so they could scale up the customer’s Cribl.Cloud environment in case there was an unexpected bottleneck at the infrastructure level. It had already been scaled up slightly a month or two prior but we were re-evaluating the worker node quantity in case we missed something. The SREs also assisted with troubleshooting from the infrastructure perspective. Although the SREs found some oddities in the network traffic between the Cribl.Cloud workers and Azure Event Hubs, they didn’t find anything that definitively and sensibly explained why throughput was higher for Azure cloud worker nodes than for AWS worker nodes. Some general theories suggested that networking (AWS <> Azure) might be at fault, but we still did not have specific and actionable answers. If we knew anything for certain at this time, it was all the things that the problem could not be but we still weren’t sure what the problem was.
At about two months in, the customer began escalating the issue. Cribl Support then enlisted the software engineering team to begin analyzing the code used for collecting from Event Hubs to see if there was a defect. The software engineers didn’t find any blatant bugs but made some modifications to expose more settings to see if tuning those would make any difference. They also changed the code to use the native Azure Event Hubs library in case the KafkaJS library currently in use was at fault. The engineers then began conducting what became numerous internal test runs to see how much of a throughput difference the newly tunable parameters and new library made. As a result of that testing, they began to see improvements.
The software engineering team’s results indicated the Azure EH native library combined with specific values for the newly exposed settings achieved much higher throughput results that were about 3x more than our highest throughput to date. This was exciting news, and the customer was happy to see more progress. However, the excitement was short-lived because those throughput results weren’t consistent once testing was expanded to other Azure regions and especially between Cloud providers. Our software engineers discovered drastic throughput differences not just between AWS and Azure but even inter-region within Azure. Testing showed the highest throughput was achieved when worker nodes were deployed in the same region as the Event Hub but a drastic reduction in throughput occurred using any region other than Azure Central. We couldn’t explain this. Azure documentation couldn’t explain this. Our software engineers had started uncovering a more systemic problem with Azure Event Hubs.
We knew we wouldn’t be able to solve this problem without help from Azure Support. One of our recent SRE hires reached out to some contacts he had at Microsoft; at the same time, our customer reached out to their Microsoft Support contacts. As a result, Cribl Support and Engineering were able to communicate with their counterparts in Azure services. We shared our test results with them but had mixed progress and little feedback from that first interaction. Subsequently, both our SRE resource and our customer pushed harder within Microsoft to get us in touch with different people who could make waves.
Within a few days, another call was scheduled with Microsoft Azure Support. We explained our test results again. We spoke to different people this time who better understood the problem but still couldn’t explain our results so they initially tried to place blame on network latency. But our tests showed that throughput was bad even between regions within Azure for which Azure’s latency chart showed latency should be minimal (i.e., < 10 ms). We convinced them to do their own testing. And our Support management was adamant when requesting a follow-up meeting within two days that brought Microsoft, Cribl, and our mutual customer together to ensure continued progress. At that meeting, the Azure personnel stated they had performed their own tests and discovered the potential problem.
So what was the solution? Azure EH engineers stated they made a change to increase the TCP socket receive buffer size within the C# .NET code that runs the backend for all Event Hubs brokers. The default value they used for that server-side setting was very small. This change provides for much larger payloads across Azure regions and between Azure and other cloud providers. It makes each poll to the Kafka broker receive more data in response, which makes the communication more efficient, and it is necessary because our customer requires at least 110 MBps. Azure Support told us that this change would be applied to all of their Event Hubs customers in all namespaces and regions!
Before this change, the best throughput we saw was about 188 MBps, and that was confined to a Stream worker node in Azure Central polling an Event Hub also located in Azure Central. With the backend change applied, the throughput increased to about 470 MBps, which was achieved across Cloud providers.
In summary, this was one of the semi-rare moments when three separate companies had to band together to achieve a common goal. We received amazing attention from Azure Support. On top of that, Cribl had numerous personnel engaged every day from Engineering, Support, PM, and SREs working together to arrive at a solution.
This problem intrigued us because we’d never seen anything like it before. We didn’t expect that our findings would result in backend changes to the Event Hubs service that could benefit every Event Hubs customer. But because our customer was affected, and it was such an interesting problem, it motivated us to find the root cause. We were also driven by our customer first mentality because their success is our success. Our customer was patient and willing to work with us through the entire journey because they could see we were making progress through our daily updates via email, Slack, and phone; and, good news for us, they realized based on our hard facts that the root cause was on Microsoft side rather than Cribl. Another customer success story in the bag!
The fastest way to get started with Cribl Stream, Edge, and Search is to try the Free Cloud Sandboxes.