Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›Cribl.Cloud Solution Brief
The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›Explore Cribl’s Solutions by Use Cases:
Explore Cribl’s Solutions by Integrations:
Explore Cribl’s Solutions by Industry:
Try Your Own Cribl Sandbox
Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›Sally Beauty Holdings
Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›Cribl Corporate Overview
Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›Stay up to date on all things Cribl and observability.
Visit the Newsroom ›Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›Data is growing, and we are being asked to search larger and larger amounts of data. This puts larger and larger demands on Search resources. Reading all the data to find matching events is muscling through the data. Wouldn’t it be more efficient to be able to do filtering before reading the data? Cribl Search does precisely that by leveraging Parquet Pushdowns. Let’s discuss Parquet Pushdowns and look at a Cribl Search example where we see a 96% reduction in computing resources using Predicate and Projection Pushdowns.
For those unfamiliar with Parquet, it is a popular columnar storage file format and offers efficient data storage and processing. It’s widely used with large data processing and is the file format used in Amazon Security Lake. The critical performance benefits result from the columnar storage that Parquet leverages. Let’s understand the key differences between columnar storage vs. traditional row-based storage (CSV, JSON).
In row-based storage formats, the data for a single row is stored together on disk. In Parquet, however, the values for each column are stored together. With row-based storage, all the data for a single row must be scanned during a query. For many of the queries, we don’t require all the columns or fields to be involved. We can omit the unnecessary columns with these queries and only read the required ones. This reduces the amount of data read from the disk and dramatically impacts the query speed.
We’ll analyze the performance benefits using the NYC Taxi & Limousine Commission data. They provide trip data in the parquet format that can be accessed here. We’ll be using the Yellow and Green taxi datasets for March of 2023 to explore the performance gains of Projection and Predicate Pushdown. There are several fields in the dataset, and we’ll leverage searches focusing on the trip_distance field. The dataset includes over 3.4M taxi rides.
Cribl Search’s utilization of Projection Pushdown optimizes data retrieval and process by selecting and reading only the required columns or fields. Projection is selecting only certain columns or fields from the Parquet file. Pushdown is the concept of having this work done at lower code levels. This filtering takes place where the data is read. Contrast this with JSON or CSV files where the entire row must be read, the field found, and then additional processing can be done. The Parquet format with Projection Pushdown has significant performance improvements focusing only on the fields that matter.
Let’s take a look at a simple search example to see what the performance gains are. We’ll look at the search where we want to find trips under 0.1 miles with and without Projection Pushdown.
dataset="parquet" data="taxi" source="s3://parquet/taxi/*_tripdata_2023-03.parquet" trip_distance<0.1 |count
Without Projection Pushdown, every column is returned. However, in this search scenario, we only need a row count. We see that impact and benefit with 1.3GB being scanned without Projection Pushdown compared to 114.79MB when used. That’s a 91% reduction in the amount of data read.
The benefits of optimizing data access and the columns read translate directly into the computing resources required to complete a search. When looking at the CPU Seconds required to run with and without the Projection Pushdown, we see a 76% reduction in compute time from 354.734 seconds to only 84.232 seconds.
You can also use the project operator within Search to determine what fields to include. This will limit the columns being returned as you project the columns desired. You can learn more about the project operator here.
dataset="parquet" data="taxi" source="s3://parquet/taxi/*_tripdata_2023-03.parquet" trip_distance<0.1 |project trip_distance, tip_amount, _time
Reducing the amount of data read impacts the rest of the search processing and provides for more efficient use of resources. While Projection Pushdown gives us tremendous gains in performance, Predicate Pushdown can help us get additional benefits.
A predicate is a condition or filter you use in a query. Using our sample dataset and the search for the trip count under 0.1 miles, Predicate Pushdown takes that filter and pushes it down to the code that limits the data being read from the Parquet files. Our desired result is only to read the data with trips under 0.1 miles. To understand how this works with Parquet, we need to look at Parquet column metadata.
While Parquet stores data in the columnar format, it does not store all of the rows in a column together. Instead, it uses column chunks. Let’s oversimplify this for an explanation. Let’s assume our dataset has 20 rows. Parquet might break that up into 2 column chunks with ten rows each. For each column chunk, Parquet creates metadata, including statistics such as the maximum and minimum values for each column. You might envision it looking something like this.
Cribl Search pushes down the predicate and reads the column chunk metadata to determine whether or not the chunk needs to be read and processed. With our search example we are looking for trips under 0.1 miles. Using the sample chunks above, Cribl Search would skip the first chunk because the minimum value in the trip_distance field is 0.78, above our predicate of trips under 0.1 miles. The second chunk matches our predicate, which would be read and processed. With large files, Predicate Pushdown can have a big impact. Let’s look at our example search and the impact of Predicate Pushdown on Cribl Search performance.
dataset="parquet" data="taxi" source="s3://parquet/taxi/*_tripdata_2023-03.parquet" trip_distance<0.1 |count
With Predicate Pushdown and the ability to skip sections that don’t meet the query parameters, we’ve reduced the data needed to read further. It is reduced by over 98% from the Projection Pushdown, with only 1.94MB read. That’s a massive reduction from the original 1.3GB.
The CPU resources were reduced by over 86% as well. Up to 11.475 seconds, we have an extremely performant search that searches over 3.47M events.
Thus far, we’ve looked at some searches returning counts with just two files. The March 2023 Yellow and Green taxi data files. Let’s scale this up to a much more extensive search across existing 2023 data (January-July) and all the 2022 data. We’ll search across a total of 38 parquet files containing the data.
dataset="parquet" data="taxi" source="s3://parquet/taxi/*.parquet" trip_distance<0.1 |count
Even with the large data set, where we find 1,175,201 rides, the search returns in 8.41 seconds. Cribl Search only had to scan 36.46MB of data.
As data grows, we’re facing challenges regarding efficient search capabilities. We’ve looked at some of the Parquet data format benefits and how leveraging Projection and Parquet Pushdowns helps reduce the amount of data that needs to be read and processed to return search results. By default, Cribl Search leverages both pushdowns when searching files in the Parquet format to help customers get the most performant search experience.
Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.
We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.
Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.
Classic choice. Sadly, our website is designed for all modern supported browsers like Edge, Chrome, Firefox, and Safari
Got one of those handy?