Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

How Cribl Search Can Save You From Drowning in a Deluge of Data
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Avoid Vendor Lock-in ›

Agent Consolidation ›

Slash Storage Costs ›

Free Up Space for High-Value Data ›

Route From Any Source To Any Destination ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

July 31 | 10am PT / 1pm ET

Navigating the Data Current Report: Transforming IT & Security Operations in 2024
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

How To Replay Events from Object Storage into Your Analytics Systems using Cribl Stream

September 28, 2021

Categories: Engineering

Back To Blogs

The Replay functionality in Cribl Stream fundamentally changes how organizations manage data by providing an easy way to ingest and re-ingest data into systems of analysis selectively. In addition to a little personal history about my introduction to Replay, we will walk through how to use this pioneering feature step by step.

Big Data Dreams – a Brief History of Replay

When the Cribl founders first described their plans for the feature to me in 2019, I was already sold on the incredible filtering, routing, and transformation features in Stream. Then, they started discussing how great it would be if I could stream less-vital events to an object store, bypassing my expensive Splunk storage. Or make my Splunk retention time very short, but allow easy replay of expelled events. None of that bucket-thawing spectacle. It seemed like a lofty goal. In the world of log analysis, eating your cake and having it too is just not the norm. I was skeptical.

“Yeah, sure. But what about when you need it?” I contested, “Re-ingestion is a pain in the… ” They interrupted me: “You’ll be able to selectively and easily replay data you want, back out of that object store into Splunk, any time you want, just as if it were arriving in the system fresh the first time. No custom code required. You’ll be able to specify – by host, source, time, or even raw log content – which events would be rejuvenated.”

“Not only that,” they continued, “your object store is much cheaper to dump data into than the SSDs in your Splunk cluster. You could simply dump a copy of all incoming data to an object store, and have longer retention for all your event data. Then, if data ages out of your log analysis platform, you could replay it in the same manner if it became vital to have again. And, hey, while we’re at it, you could do this via API calls, to automate event replay. Picture an event triggering an alert in your SIEM, which then makes an API call to Stream, to replay all the events related to the IP address in the trigger event for the last 30 days.”

Sure, why not. It will make me an omelet in the morning, too. I mean, it was their dream. So make it as big as you want, guys.

Fast-forward not even one year later, and they delivered the Replay feature in Stream. It’s real, and it’s fantastic.

How S3 Storage and Replay Works

For brevity, we’ll refer to the storage destination here as S3, although it could just as quickly be MinIO or many other options.

Format Considerations

You can write data out of Stream in either of two formats—JSON or raw:

JSON: The parsed event, with all metadata and modifications it contains when it reaches the Destination step, will be wrapped in a JSON object. Each event is one line. We refer to this as newline delimited JSON (aka ND JSON). For example, this is syslog data using the JSON format option:

Raw: The content of the event’s _raw field, unparsed, at the time it reaches the Destination step, is written out in plain text. Each event is one line. This is the same syslog data, unmodified, and unparsed; using the raw option:

We recommend using JSON format, which is the default. Things like timestamp extractions and other vital enhancements should all have been performed already. Re-using that info makes sense and will make your Replay simpler. Notice the second screen cap above is literally just the original data. There are cases where maybe you want that; but usually having the pre-processed event is a good thing.

Writing the Data Out

The worker nodes stage files until certain limits are reached: Time open, idle time, size, or number of files. These settings are available under the Advanced tab in the Destination config. Once one of the limits is reached, the worker gzips the file and drops it into the object store. If you reach the open file limit, the oldest file will be targeted.

The settings for the S3 Destination also allow you to define how the uploaded files are partitioned. Host, time, sourcetype,… all the metadata is available to you for this purpose. When you’re replaying data from the store, these partitions will be handy to make your replay searches faster. We can map segments of the path back to variables, including time, that you can use to zero in on the exact logs you need to replay, without requiring checking _raw. In the how-to further down, I provide an example of the expressions we recommend.

Retrieving the Events

With events stored in an object store, we can now point Stream to that store to Replay selected events back through the system. We want to use data in the file path as an initial level of filtering to exclude as much data as we can from the download. Object retrieval and unpacking impose an enormous resource hit in the Replay process. Minimize your impact radius. Searching against _raw data is also possible but should be secondary to _time, sourcetype, index, host, etc.

More details: The leader picks one node to do the discovery exploration to find the potential objects that are in play. That list of targets is then doled out to the worker group to actually pull down the objects and examine them for content matches before executing final delivery. All worker nodes share the workload of retrieving and reinjecting the data.

Finally, we need to process the data coming back to extract each event. As with any incoming data stream on a compatible Source, Stream can use default or custom event breaker definitions. In this case, we recommended above to use JSON as the format of the events when we write to disk, so we’ll use the Cribl event breaker ruleset on this source. This EB contains a newline-delimited JSON definition.

Putting it all together: How To Set Up and Use Replay in Stream

The Store

We usually talk about object storage, but any shared storage will work. It could be NFS or sshfs mounts (please don’t). As long as all the Stream nodes can see it, we’re ready to go. For the purposes of this post, let’s stick with an S3-compatible store.

You’ll need all the credentials, keys, secret keys, endpoints, etc., required to access the bucket you intend to use in your object store. The bucket should be able to grow to the size designed for your long-term archival needs.

S3 “deep-freeze” services like Glacier do not work with Replay.

The Destination Definition

In your Stream worker group config, create a new S3 Destination. The screenshot below is what mine looks like for this demo. Also, visit the Destination’s Advanced Settings tab to adjust the limits if needed. The defaults are fine for most situations, but we could be talking about 100+ files depending on your partitioning scheme. Make sure your staging area has enough space! Set the Max open files option appropriately. It overrides the size and time limits. We recommend that the staging area be its own volume, so you don’t fill up a more vital volume by mistake.

Included in this screen grab are the partitioning and file name prefix expressions. The full text of each is below. In a nutshell, we’re using time and other metadata to construct a path in the object store that will be useful to us at replay time:

Year/Month/Day/index/host/sourcetype/HHMM-foobar.gz

Partitioning (1 line):

`${C.Time.strftime(_time ? _time : Date.now() / 1000, '%Y/%m/%d')}/${index ? index : 'no_index'}/${host ? host : 'no_host'}/${sourcetype ? sourcetype : 'no_sourcetype'}`

File Name:

`${C.Time.strftime(_time ? _time : Date.now() / 1000, '%H%M')}`

We’ve also included a Key Prefix from the Knowledge -> Global Variables. You could use this to partition your logs by environment or any other qualifier. Or not – it is not a required field.

The Archival Route

Create a new Stream route called Archival that matches everything you want to archive. In my case, I have it set to !__replayed and set this internal field in the source (see next step). Any event that is not coming from my defined Replay process will be archived. Select the passthru pipeline (or create an empty pipeline and use that). Select the S3 destination you created above. And finally, make sure Final is not set. We want data to flow through this route, not stop at it. I position the Archival route at or near the top of the route list. Save your work, commit, and deploy.

Once saved and deployed, data should be flowing to your object store. Check your work! Go to the Monitoring dashboard and select Destinations at the top. We want to see a green checkmark on the right side for your S3 destination. If it’s not green, find out why. You can also use an S3-capable browser, like Cyberduck, to check the files more manually. (Troubleshooting this is outside the scope of this doc.)

The Replay Collector

Go to the Collectors tab (pre v3.1) or Sources tab -> Collectors (v3.1+). Create a new S3 collector called Replay. Use the Auto-populate option to pull the configs in from your S3 destination.

We’ll need to establish the tokens to extract from the path on the S3 store. Below is our recommendation if you used the recommended partitioning scheme in the Destination definition above. If you chose to leave out C.Vars.MYENV, exclude the first segment below. It is vital this scheme match your partitioning scheme in the Destination definition. In the Path field:

With these tokens defined, you can define filters that exclude/include relevant files before Stream is required to download and open them. This cuts down on the work required by orders of magnitude.

This is a core concept! We want to avoid having to open files to check for matches, by instead relying on key data in the path. Stream can immediately exclude, without downloading, the files that have no chance of matching our target events. Using _time, sourcetype, host and index, it can accurately zero in on the target files. Stream will download and interrogate the contents of what’s left after this high level filtering, and send the matches along to the routing table.

Finally, under Result Settings -> Event Breakers, select the Cribl ruleset. This ruleset understands how to parse the JSON packaged events stored by the Archive Destination.

To make it easier to identify events that have been replayed, create a field (Result Settings -> Fields) and set __replayed to true. You could use this to filter in routes and pipelines later, but since it’s a double underscore field, it’s internal only. It won’t be passed on to your final destinations. In a previous step we use it to prevent replayed events from being re-archived.

Save it. Commit and Deploy.

Testing a Replay

After you’ve accumulated some data in your S3 store, head over to Pipelines and start a capture. For the filter, use __replayed and run it for 300 seconds, 10 max events. Once it’s running, in a new browser window, navigate back to Collectors and run your Replay collector. Filter on true, and set Earliest time to -1h and Latest to now. You can run it in Preview mode to make sure you get results, then come back and do a full run; or just do a full run right off the bat if you’re confident. With a running job, you can click on the job ID to follow its progress. You can also pop back over to the Capture browser window or tab, and you should see events showing there.

In a full run, the events will proceed through your routes as normal, and land wherever your original routes and pipelines dictate. In my case, they landed in Splunk, and I could easily see duplicate events, EXACT duplicate events, in the timeframe defined in the collector job (the last 1 hour).

Once defined, the collector can be controlled via scheduling, manual runs, or API calls. And in actual use, you’d want to fill out the collector’s Filter expression to meet your needs.

Conclusion

The Replay feature in Stream is a true game changer for anyone managing digital exhaust data at scale. By separating your system of retention from systems of analysis, you can optimize your budget in ways not possible before. And since the archived data is stored in an open format, you’re free to use it however you like.

We offer free training with live Stream instances, a free-to-use cloud instance of your very own, free downloads, and a free tier of ongoing usage.

True story: I was so excited at all the Stream features, and at Cribl’s demonstrated excellence, that I quit my job in spring of 2021 to join this amazing team. If you’d like to work with super-sharp people in an exciting young company, we’re hiring.

The fastest way to get started with Cribl Stream is to sign-up at Cribl.Cloud. You can process up to 1 TB of throughput per day at no cost. Sign-up and start using Stream within a few minutes.

Blog

Preventing Friction With an Impactful Security Champions Program

Blog

From Necessity to Opportunity: The Customer Push for SIEM Options

Blog

Securing the Foundation of Cribl Copilot

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.

Launch Now

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Download Library

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us

How To Replay Events from Object Storage into Your Analytics Systems using Cribl Stream

Big Data Dreams – a Brief History of Replay

How S3 Storage and Replay Works

Format Considerations

Writing the Data Out

Retrieving the Events

Putting it all together: How To Set Up and Use Replay in Stream

The Store

The Destination Definition

The Archival Route

The Replay Collector

Testing a Replay

Conclusion

Blog

Preventing Friction With an Impactful Security Champions Program

Blog

From Necessity to Opportunity: The Customer Push for SIEM Options

Blog

Securing the Foundation of Cribl Copilot

Try Your Own Cribl Sandbox

So you're rockin' Internet Explorer!