Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

How Cribl Search Can Save You From Drowning in a Deluge of Data
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Avoid Vendor Lock-in ›

Agent Consolidation ›

Slash Storage Costs ›

Free Up Space for High-Value Data ›

Route From Any Source To Any Destination ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

July 31 | 10am PT / 1pm ET

Navigating the Data Current Report: Transforming IT & Security Operations in 2024
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

Streaming Data Deduplication with Cribl

May 14, 2019

Categories: Engineering, Learn

Back To Blogs

The Problem

It’s not uncommon for machine data systems to send and receive duplicate or repeated events. This could be due to a variety of reasons, for example;

Misconfiguration (a.k.a layer-8 problems) on the source, intermediary or aggregation systems may cause duplicate data to be sent out.

Bugs or software defects may cause a data source to occasionally (or worse, frequently) send duplicates.

Unreliable networks or connectivity issues may force sources to resend data.

Using certain protocols (e.g., UDP) may also cause out of order events or duplicates.

Repeated, low value events that indicate fast cycles or loops (e.g., network device interfaces flapping rapidly).

The functional impact of duplicates will depend on the situation and use cases being served by the system. For instance, a relatively low number of dupes may be insignificant when the system is mostly used by humans for troubleshooting. On the other hand, a large number of dupes can throw off security & operational analytics. Especially when they depend on summarization and/or statistical aggregations.

From a system perspective, a large number of dupes can impose a significant stress on performance and storage. Searches, queries, dashboard views etc. all will be impacted due to the added number of events to sift through. This can and will affect end-user experience. Storage will also be consumed faster. Which means higher value data will likely be unnecessarily/prematurely aged out.

Solution

It is clear that when dupes occur it becomes necessary to eliminate them. Cribl puts the machine data administrator in control of the flow and provides functionality to reduce dupes inline before forwarding to a downstream receiver.

The Suppress function plays a central role here. The way it works is the following: First, users define a key expression to uniquely identify events in a stream. Next, the function checks if that key has been seen before. If yes, the event is dropped. Else, it is allowed through. In pseudo-code:

for each event e:
    if haveWeSeenThis(e.key): 
      drop(e)
    else:
      letItThru(e)

Data deduplication keys are not stored for all time. Instead, they work on a user defined time window. As the window moves forward in time, new keys are inserted and older ones are expired.

The key expression allows the administrator to surgically articulate how to uniquely identify events. This can be as coarse as the entire event content or as fine as a very specific combination of event’s fields.

The nuanced benefit that the key expression brings about is critical. For instance, in the sample above, even though some events look identical you can’t actually deduplicate them based on their full content because time is different in each one. In fact, given the nature of the timestamp field (i.e. typically monotonically increasing) it should be excluded from all your deduplication efforts. For this particular source here we can construct a key expression that targets only relevant fields for a more precise basis for deduplication. For example, this expression will compose a key using the values of channel, message and src fields:

`${channel}:${message}:${src}`

Suppress Function

The Suppress function offers a few configuration options that can be tuned to precise customer data deduplication requirements. Let’s take a look.

Key Expression: This is used to build a key to uniquely identify events to suppress. For example in our case:`${channel}:${message}:${src}`
Number to Allow: Number of events to allow per window. Typically this will be 1.
Suppression Period: This is the time window, in seconds, to suppress events after ‘Number to Allow’ events are received. It is recommended to keep this down to a reasonable number, say, less than 60s or so. The greater this number the larger the corpus of keys to store and compare against.
Drop Suppressed Events: This control can be used to indicate if suppressed events should be dropped or just decorated with suppress=1. This is useful when other functions down the pipeline need to be invoked on duplicate events or to assist downstream systems if they are to remove duplicates.

Notice how in this sample events #2 and #3 have the same key (see expression above) and #3 has been dropped as indicated by strikethrough/greyed-out styling.

Indicating the number of events dropped is always done in context. I.e. no more of those loose, separate “Last message repeated N times” events. The next event that is qualified to be allowed through will be decorated with suppressCount=N where N represents the number of events dropped in the previous time window.

Depending on the key expression the number of identified duplicate or repeated events can be quite significant.

Many Possibilities

Removing or minimizing duplicates results in a better experience across the board; the analytics/end system performance will increase, fewer CPU/Storage resources will be wasted and overall user experience will improve. Cribl makes the process of dealing with duplicates in real-time easy. Powerful articulation of key expressions allows for custom and fine-grained targeting of repeated patterns. Give it a try today!

—

If you’d like more details on configuration, see our documentation or join us in Slack #cribl, tweet at us @cribl_io, or contact us via hello@cribl.io. We’d love to help you!

Enjoy it! — The Cribl Team

Get Cribl LogStream Now!

Blog

Preventing Friction With an Impactful Security Champions Program

Blog

From Necessity to Opportunity: The Customer Push for SIEM Options

Blog

Securing the Foundation of Cribl Copilot

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.

Launch Now

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Download Library

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us

Streaming Data Deduplication with Cribl

The Problem

Solution

Suppress Function

Many Possibilities

Get Cribl LogStream Now!

Blog

Preventing Friction With an Impactful Security Champions Program

Blog

From Necessity to Opportunity: The Customer Push for SIEM Options

Blog

Securing the Foundation of Cribl Copilot

Try Your Own Cribl Sandbox

So you're rockin' Internet Explorer!