Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

How Cribl Search Can Save You From Drowning in a Deluge of Data
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Avoid Vendor Lock-in ›

Agent Consolidation ›

Slash Storage Costs ›

Free Up Space for High-Value Data ›

Route From Any Source To Any Destination ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

July 31 | 10am PT / 1pm ET

Navigating the Data Current Report: Transforming IT & Security Operations in 2024
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

How Does Persistent Queuing Work Inside Cribl Stream?

July 31, 2023

Written by

During his many years at Cribl, Splunk, Cloudera, and Oracle he was part of multiple impl... Read Moreementations of security, analytics, cloud, open-source, and IT use cases as well as big data and data lake projects in complex environments. Raanan is a global resource with 30 years of experience building large data clusters. He has helped thousands of customers, including some who ingest several hundred terabytes per day and store multiple petabytes of data. Read Less

Categories: Cribl Stream, Engineering

Back To Blogs

Preventing data loss for data in motion is a challenge that Cribl Stream Persistent Queues (PQ) can help prevent when the downstream Destination is unreachable. In this blog post, we’ll talk about how to configure and calculate PQ sizing to avoid disruption while the Destination is unreachable for a few minutes or a few hours.

The example follows a real-world architecture, in which we have:

Processing: 25 Cribl Stream Worker Nodes, each with 36 vCPU each to process the data.
Storage: 25 Cribl Stream Worker Nodes, each with 900 GB SSD local storage available for Persistent Queuing.
Output: Cribl Stream does data reduction, and the output is 35TB that we send to 120 Splunk Indexers. In addition, all metrics data is sent to a different Destination.

Destination Persistent Queues Under the Hood

Under the hood, Cribl Stream Persistent Queuing is implemented at the Worker Process level. The Worker Processes each knows their own failed connections and persistent queue sizes independent of each other.

Worker Processes attempt In-memory queuing first. Each Worker Process output has an in-memory queue that helps it absorb temporary imbalances between inbound and outbound data rates. For example, if there is an inbound burst of data, the output will store events in the queue, and will then output them at the rate to which the receiver can sync.
The filesystem queue is attempted only when Cribl Stream receives an error from the downstream Destination and starts storing the data on disk.
In our case, we have 34 Worker Processes in each of our 25 Worker Nodes. For example, Worker Process (WP) 18 cannot send data to the Destination, so it sends the events to the filesystem PQ location. In the meantime, all the other WPs keep on working as normal.

When the receiver is ready, the output will start draining the queues in FIFO (First In, First Out) fashion.

During the draining process, new events will continue to be written to the queue until Cribl Stream has successfully shrunk the queue, and the final file on disk can be flushed and removed. At that point, Cribl Stream goes back to fully in-memory processing.

Another option during the draining process, if Strict ordering is disabled, Cribl Stream will prioritize new events over draining the queue. This is like LIFO (Last In, First Out) fashion.
Throttling the queue’s drain rate can boost the throughput of new/active connections, by reserving more resources for them.

Source Always-On Persistent Queues Under the Hood

With Always-On mode, PQ will always write events directly to the queue before forwarding them to the processing engine

All events are written to disk as they’re received by the Cribl Stream Source
Since every event must be written to the disk, and then read from the disk, this option adds load and delay to the system.
Cribl recommends using this option for Sources like Syslog UDP. However, due to the overhead of always going to disk this option might not be optimal for most sources.

Source Smart-mode Persistent Queues Under the Hood

With Smart mode, PQ will write events to the filesystem only when it detects backpressure from the processing engine.

Few conditions that will cause PQ in Smart mode to engage:

When the Worker Process is experiencing delays due to heavy processing.
When the Destination is causing delays and backpressure.

When the receiver is ready, the output will start draining the queues in FIFO (First In, First Out) fashion.
During the draining process, new events will continue to be written to the queue until Cribl Stream has successfully shrunk the queue, and the final file on disk can be flushed and removed. At that point, Cribl Stream goes back to fully in-memory processing.

What Did the Configuration From Cribl Stream to Splunk Look Like?

To enable persistent queueing, go to the Destination’s configuration page and set the Backpressure behavior control to Persistent Queue. This exposes the following additional controls, which we set with these values:

Max file size: 1 MB
Max queue size: 25 GB
Queue file path: $CRIBL_HOME/state/queues
Compression: None
Queue-full behavior: Drop new data

Why Have We Decided to Use These Settings?

Using 25 Cribl Stream Worker Nodes, 36 vCPU each, and 900 GB SSD local storage for Persistent Queues as the available hardware, we made the following choices:

Max file size: 1 MB
- 1 MB is the default maximum file size, and we did not see a good reason to change it.
Max queue size: 25 GB
- This flag should be translated as “Maximum queue size per Worker Process.”
- Since we have 36 vCPUs per Worker Node, we used 34 Worker Processes on each, reserving 2 vCPUs for Cribl Stream itself.
- The hardware we used included 900 GB SSD local storage. We calculated 900 (Disk) / 34 (WP) = 26 GB. To make sure we do not consume all the disk space, we chose a Max queue size of 25 GB.
- 25 GB per Worker Process means we will use, at most, 850 GB of disk space per Worker Node.
Queue file path: $CRIBL_HOME/state/queues
- This is the default queue file path, and we did not see a good reason to change it.
Compression: None
- Gzip would enable us to consume more data, but it would also take longer to compress the data set to disk and decompress it. So, we decided to not use compression.
- SSD gives us the option to read and write the event to disk very quickly.
Queue-full behavior: Drop new data
- Using 25 Cribl Stream Worker Nodes x 850 GB of disk storage, we get 21 TB of total disk space for Persistent Queuing.
- The daily output to Splunk is 35TB.
- That means that in this case, Cribl can handle about 14 hours of Splunk downtime.
- Once the queue is full, we decided to drop new incoming data. For our use case, we had one additional Destination. Using the Queue-full behavior: Drop new data option means that the other Destination will keep on getting data. Had we instead used the Block option, all data into Cribl Stream would stop once the queue filled up.

What Is an Appropriate Value for the Drain Rate Limit (EPS)?

We recommend that you start with roughly 5% of the Events Per Second (EPS) throughput rate. And if that value is too low and the Persistent Queue is not draining fast enough, increase it.

Steps to find the Events In EPS throughput rate:

On the Cribl Stream Monitor page -> Overview page, you will see the Events In and Events Out display.
Change the Monitor page to a single Worker, and change the display’s granularity from the default last 5 min to 24 hours.
In the Events In and Out display, find the Thruput In (AVG) number. For example, 175k EPS.
An approximate number for the Drain rate limit should be 8,000.

Persistent Queues on the Filesystem, Monitoring, and Notification

What is the structure of Filesystem-backed PQ?

Files are stored in the directory the user specifies (in our case, /cribl/state/queues), and files are written out using worker ID, Destination output ID, and a strictly increasing unique identifier

For example:

cribl/state/queues/0/splunklocal

1049897 Nov 11 00:56 queue.0.ndjson

1048600 Nov 11 00:57 queue.1049897.ndjson

887534 Nov 11 00:59 queue.2098497.ndjson.tmp

This naming scheme ensures that multiple instances on the same machine do not stomp on queue files stored in the same directory.

In the above example, we can see that once the file reached 1MB file size, it changed from a tmp file to an ndjson file.

How Do We Make Sure That the Persistent Queues Get Engaged, Store Our Events, and Flush the Stored Data to the Destination?

Cribl Stream allows us to see Persistent Queuing in action using the Monitoring page, as well as the internal logs.

Navigating to Monitoring -> System -> Queues, we can see when the Destination engaged with Persistent Queues and flushed the data, <something happened>.

In addition, looking at the Destination’s Logs tab, we can see all the messages:

connection error -> begin … end backpressure -> complete flushing persistent queue.

Can We Be Notified When Persistent Queuing Is Engaged?

Cribl Stream enables you to set Notifications when Persistent Queueing engages or exceeds a configurable threshold. These Notifications can be sent to external systems (for example, if we want to send an email alert), or we can choose to display Notifications only within Cribl Stream’s Messages pane and internal logs.

To enable Notifications when Persistent Queues engage, go to the Destination’s modal page and select Notifications -> Add New. In the Condition drop-down, pick the Destination Backpressure Activated option. Note that the Default target: System Messages is always enabled. If desired, select Add target -> Create to configure sending Notifications to external systems as well.

Once the Persistent Queues have engaged, we can see these Notifications in Cribl Stream’s Messages pane:

Persistent Queuing to the Rescue

Cribl Stream’s persistent queuing (PQ) feature helps minimize data loss if a downstream receiver is unreachable. PQ provides durability by writing data to disk for the duration of the outage and forwarding it upon recovery. We have a wealth of more detailed documentation over on the Cribl Docs site as well that is completely ungated and free to access.

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a generous free usage plan across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started. We also offer a hands-on Sandbox for those interested in how companies globally leverage our products for their data challenges.

Blog

Preventing Friction With an Impactful Security Champions Program

Blog

From Necessity to Opportunity: The Customer Push for SIEM Options

Blog

Securing the Foundation of Cribl Copilot

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.

Launch Now

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Download Library

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us

How Does Persistent Queuing Work Inside Cribl Stream?

Written by

Raanan Dagan

Destination Persistent Queues Under the Hood

Source Always-On Persistent Queues Under the Hood

Source Smart-mode Persistent Queues Under the Hood

What Did the Configuration From Cribl Stream to Splunk Look Like?

Why Have We Decided to Use These Settings?

What Is an Appropriate Value for the Drain Rate Limit (EPS)?

Persistent Queues on the Filesystem, Monitoring, and Notification

What is the structure of Filesystem-backed PQ?

How Do We Make Sure That the Persistent Queues Get Engaged, Store Our Events, and Flush the Stored Data to the Destination?

Can We Be Notified When Persistent Queuing Is Engaged?

Persistent Queuing to the Rescue

Blog

Preventing Friction With an Impactful Security Champions Program

Blog

From Necessity to Opportunity: The Customer Push for SIEM Options

Blog

Securing the Foundation of Cribl Copilot

Try Your Own Cribl Sandbox

So you're rockin' Internet Explorer!