January 11, 2022
Preventing data loss for data in motion is a challenge that Cribl Stream Persistent Queues (PQ) can help prevent when the downstream Destination is unreachable. In this blog post, we’ll talk about how to configure and calculate PQ sizing to avoid disruption while the Destination is unreachable for a few minutes or a few hours.
The example follows a real-world architecture, in which we have:
How does Persistent Queuing work inside Stream? Under the cover, Stream Persistent Queuing is implemented at the Worker Process level. Each Worker Process independently knows about its own failed connection and its own Persistent Queuing sizing.
In-memory queuing is attempted first. Each Worker Process output has an in-memory queue that helps it absorb temporary imbalances between inbound and outbound data rates. For example, if there is an inbound burst of data, the output will store events in the queue, and will then output them at the rate to which the receiver can sync.
The filesystem queue is attempted only when Stream receives an error from the downstream Destination and starts storing the data on disk. In our case, we have 34 Worker Processes in each of our 25 Worker Nodes. For example, Worker Process (WP) 18 cannot send data to the Destination, so it sends the events to the filesystem PQ location. In the meantime, all the other WPs keep on working as normal.
When the receiver is ready, the output will start draining the queues in first in, first out (FIFO) fashion. During the draining process, new events will continue to be written to the queue until Stream has successfully shrunk the queue, and the final file on disk can be flushed and removed. At that point, Stream goes back to fully in-memory processing
Files are stored in the directory the user specifies (in our case, /cribl/state/queues
), and files are written out using worker ID, Destination output ID, and a strictly increasing unique identifier
For example:
This naming scheme ensures that multiple instances on the same machine do not stomp on queue files stored in the same directory.
In the above example, we can see that once the file reached 1MB file size, it changed from a tmp
file to an ndjson
file.
To enable persistent queueing, go to the Destination’s configuration page and set the Backpressure behavior control to Persistent Queue. This exposes the following additional controls, which we set with these values:
Using 25 Stream Worker Nodes, 36 vCPU each, and 900 GB SSD local storage for Persistent Queues as the available hardware, we made the following choices:
$CRIBL_HOME/state/queues
How do we make sure that the Persistent Queues get engaged, store our events, and flush the stored data to the Destination? Stream allows us to see Persistent Queuing in action using the Monitoring page, as well as the internal logs. Navigating to Monitoring -> System -> Queues, we can see when the Destination engaged with Persistent Queues and flushed the data, <something happened>.
In addition, looking at the Destination’s Logs tab, we can see all the messages: connection error -> begin ... end backpressure -> complete flushing persistent queue.
Stream enables you to set Notifications when Persistent Queueing engages, or exceeds a configurable threshold. These Notifications can be sent to external systems (for example, if we want to send an email alert), or we can choose to display Notifications only within Stream’s Messages pane and internal logs.
To enable Notifications when Persistent Queues engage, go to the Destination’s modal page and select Notifications -> Add New. In the Condition drop-down, pick the Destination Backpressure Activated option. Note that the Default target: System Messages is always enabled. If desired, select Add target -> Create to configure sending Notifications to external systems as well.
Once the Persistent Queues have engaged, we can see these Notifications in Stream’s Messages pane:
In this post, we showed how Stream can help prevent the loss of data in motion. We also talked about how to configure and calculate Stream Persistent Queues sizing. We followed a real-world architecture in which we used 25 Cribl Stream Worker Nodes, each with a 900 GB SSD local drive, to avoid disruption while the Destination was unreachable for a few minutes to a few hours.
The fastest way to get started with Cribl Stream is to sign-up at Cribl.Cloud. You can process up to 1 TB of throughput per day at no cost. Sign-up and start using Stream within a few minutes.