Preventing data loss for data in motion is a challenge that Cribl Stream Persistent Queues (PQ) can help prevent when the downstream Destination is unreachable. In this blog post, we’ll talk about how to configure and calculate PQ sizing to avoid disruption while the Destination is unreachable for a few minutes or a few hours.
The example follows a real-world architecture, in which we have:
Processing: 25 Stream Worker Nodes, each with 36 vCPU each to process the data.
Storage: 25 Stream Worker Nodes, each with 900 GB SSD local storage available for Persistent Queuing.
Output: Stream does data reduction, and the output is 35TB that we send to 120 Splunk Indexers. In addition, all metrics data is sent to a different Destination.
![image5-1](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2F28f6rdM8u9TmUZypCjidZK%2F6ffbc292b7628c08e348fed519a1938c%2Fimage5-1.png&w=2048&q=75)
Persistent Queues Under the Cover
How does Persistent Queuing work inside Stream? Under the cover, Stream Persistent Queuing is implemented at the Worker Process level. Each Worker Process independently knows about its own failed connection and its own Persistent Queuing sizing.
In-memory queuing is attempted first. Each Worker Process output has an in-memory queue that helps it absorb temporary imbalances between inbound and outbound data rates. For example, if there is an inbound burst of data, the output will store events in the queue, and will then output them at the rate to which the receiver can sync.
The filesystem queue is attempted only when Stream receives an error from the downstream Destination and starts storing the data on disk. In our case, we have 34 Worker Processes in each of our 25 Worker Nodes. For example, Worker Process (WP) 18 cannot send data to the Destination, so it sends the events to the filesystem PQ location. In the meantime, all the other WPs keep on working as normal.
![image7-1](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2F1h7QyddxL2CWAbTuuVH1vn%2F77fcb95e80d1a09b98c437e4aa15a17f%2Fimage7-1.png&w=1920&q=75)
When the receiver is ready, the output will start draining the queues in first in, first out (FIFO) fashion. During the draining process, new events will continue to be written to the queue until Stream has successfully shrunk the queue, and the final file on disk can be flushed and removed. At that point, Stream goes back to fully in-memory processing
![image6-1](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2FpmgoBLDum3eXMlK06ELLf%2F0f88ff0c87563457aaa61b28e1191ad6%2Fimage6-1.png&w=1920&q=75)
What Is the Structure of Filesystem-Backed PQ?
Files are stored in the directory the user specifies (in our case, /cribl/state/queues
), and files are written out using worker ID, Destination output ID, and a strictly increasing unique identifier
For example:
![image1-1](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2F4Ct0XaBy4jPuBikV9wHDYI%2Fc81f155c45171f9ae156e6e6599e2ad9%2Fimage1-1.png&w=750&q=75)
This naming scheme ensures that multiple instances on the same machine do not stomp on queue files stored in the same directory.
In the above example, we can see that once the file reached 1MB file size, it changed from a tmp
file to an ndjson
file.
Persistent Queues Sizing: What did the configuration from Cribl Stream to Splunk look like?
To enable persistent queueing, go to the Destination’s configuration page and set the Backpressure behavior control to Persistent Queue. This exposes the following additional controls, which we set with these values:
![image13](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2F2253Xh369hJnPOYKHTZPKk%2F317ad8d65c2a934b8307fae7af6da4ef%2Fimage13.png&w=750&q=75)
Why Have We Decided to Use These Settings?
Using 25 Stream Worker Nodes, 36 vCPU each, and 900 GB SSD local storage for Persistent Queues as the available hardware, we made the following choices:
Max file size: 1 MB
1 MB is the default maximum file size, and we did not see a good reason to change it.
Max queue size: 25 GB
This flag should be translated as “Maximum queue size per Worker Process.”
Since we have 36 vCPUs per Worker Node, we used 34 Worker Processes on each, reserving 2 vCPUs for Stream itself.
The hardware we used included 900 GB SSD local storage. We calculated 900 (Disk) / 34 (WP) = 26 GB. To make sure we do not consume all the disk space, we chose a Max queue size of 25 GB.
25 GB per Worker Process means we will use, at most, 850 GB of disk space per Worker Node.
Queue file path:
$CRIBL_HOME/state/queues
This is the default queue file path, and we did not see a good reason to change it.
Compression: None
Gzip would enable us to consume more data, but it would also take longer to compress the data set to disk and decompress it. So, we decided to not use compression.
SSD gives us the option to read and write the event to disk very quickly.
Queue-full behavior: Drop new data
Using 25 Stream Worker Nodes x 850 GB of disk storage, we get 21 TB of total disk space for Persistent Queuing.
The daily output to Splunk is 35TB.
That means that in this case, Cribl can handle about 14 hours of Splunk downtime.
Once the queue is full, we decided to drop new incoming data. For our use case, we had one additional Destination. Using the Queue-full behavior: Drop new data option means that the other Destination will keep on getting data. Had we instead used the Block option, all data into Stream would stop once the queue filled up.
Persistent Queues Monitoring and Notification
How do we make sure that the Persistent Queues get engaged, store our events, and flush the stored data to the Destination? Stream allows us to see Persistent Queuing in action using the Monitoring page, as well as the internal logs. Navigating to Monitoring -> System -> Queues, we can see when the Destination engaged with Persistent Queues and flushed the data, <something happened>.
![image9](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2F1ZYDAyzgNgWxuTHDheMLEP%2Fb1e69f1dfb7424a11d7a1d2ba36a0f8b%2Fimage9.png&w=3840&q=75)
In addition, looking at the Destination’s Logs tab, we can see all the messages: connection error -> begin ... end backpressure -> complete flushing persistent queue.
![image8-1](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2F7lcMYukIo9hGVOSssCMgpY%2Fd3da1c099b883385538839790e38df74%2Fimage8-1.png&w=750&q=75)
Can We Be Notified When Persistent Queuing Is Engaged?
Stream enables you to set Notifications when Persistent Queueing engages, or exceeds a configurable threshold. These Notifications can be sent to external systems (for example, if we want to send an email alert), or we can choose to display Notifications only within Stream’s Messages pane and internal logs.
To enable Notifications when Persistent Queues engage, go to the Destination’s modal page and select Notifications -> Add New. In the Condition drop-down, pick the Destination Backpressure Activated option. Note that the Default target: System Messages is always enabled. If desired, select Add target -> Create to configure sending Notifications to external systems as well.
![image12](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2F2HJC5wZ0NrmksLA4o2TFJF%2F74ae291657e09e81f12cdad3bcf330fd%2Fimage12.png&w=3840&q=75)
Once the Persistent Queues have engaged, we can see these Notifications in Stream’s Messages pane:
![image11](/_next/image/?url=https%3A%2F%2Fimages.ctfassets.net%2Fxnqwd8kotbaj%2F1wr0ZRihyoc8ehXh8OhmnE%2Fda358620641223c2429ebc0ee3c3ad96%2Fimage11.png&w=1080&q=75)
Persistent Queuing to the Rescue
In this post, we showed how Stream can help prevent the loss of data in motion. We also talked about how to configure and calculate Stream Persistent Queues sizing. We followed a real-world architecture in which we used 25 Cribl Stream Worker Nodes, each with a 900 GB SSD local drive, to avoid disruption while the Destination was unreachable for a few minutes to a few hours.
The fastest way to get started with Cribl Stream is to sign-up at Cribl.Cloud. You can process up to 1 TB of throughput per day at no cost. Sign-up and start using Stream within a few minutes.