image

Optimizing Data Access: Best Practices for Partitioning in Cribl

Last edited: July 10, 2024

The more customers I talk to, the more I see a trend toward wanting a low-cost vendor-agnostic data lake. Customers want the freedom to store their data long-term and typically look to object stores from AWS, Azure, and Google Cloud.

To optimize for data access, users will partition their data into directories to optimize for use cases such as Cribl Replay and Cribl Search. Only relevant files will be accessed for rehydration or search by partitioning data.

Example what partitioning may look like:

Code example
MyBucket/dataArchive/2023/03/17/glendaleAz_1.json.gz MyBucket/dataArchive/2023/03/18/glendaleAz_2.json.gz ... MyBucket/dataArchive/2023/03/24/lasVegasNv_1.json.gz MyBucket/dataArchive/2023/03/25/lasVegasNv_2.json.gz

When it comes to accessing object stores, all the major cloud providers use APIs compatible with S3. When accessing the object store, the first stage consists of listing out the relevant files through a listObjects API call. To understand what files are relevant, a partitioning schema is defined. It will look something like this for the files in the partitioning example:

Code example
MyBucket/dataArchive/${_time:%Y}/${_time:%m}/${_time:%d}

Cribl Stream or Search will use the time picker to fill in the year, month, and day variables. This will result in the object storage API returning a list of files in the appropriate time range. After that, each file can be accessed, parsed, and handled to meet the user’s request.

What NOT to Do, and How to Avoid Common Pitfalls

Example 1: Wildcards in Partitioning

Assume the object storage is partitioned in the following manner.

Code example
MyBucket/dataArchive/importantData/2023/03/31/arlingtonTx_1.json.gz MyBucket/dataArchive/eventMoreImportantData/2023/04/01/arlingtonTx_2.json.gz MyBucket/dataArchive/superImportantData/2023/04/02/arlingtonTx_3.json.gz ... MyBucket/dataArchive/superDuperImportantData/2023/04/13/tampaFl_1.json.gz

The partitioning expression would look like this.

Code example
MyBucket/dataArchive/${dataImportance}/${_time:%Y}/${_time:%m}/${_time:%d}

This partitioning scheme is great for segmenting data for human access. If I need to access business-critical data, I know exactly where to go, and if I need to access lesser-important data, it’s already in a different location. This also simplifies setting S3 retention policies, especially when data types have different retention requirements.

The pitfall here is that if data needs to be accessed by date, and dataImportance is not defined as part of the Cribl Search query or the Cribl Stream collector filter, every file under the dataArchive directory will have to be listed. The listObjects API call only allows for static prefixes, which means directory paths without wildcards. The API call must be specified as every file under MyBucket/dataArchive since dataImportance is not defined.

To avoid this pitfall, here are three recommendations:

  1. Suppose the number of dataImportance values is a limited set of values. In that case, each can be defined as an individual dataset with the value defined statically as part of the partitioning expression. This ensures that the value is defined for every search.

  2. If the partitioning expression can be changed, defining the dataImportance partition to the right of the date and time folders can allow filtering by time.

  3. If nothing else can be changed, try to define dataImportance as part of the search or filter.

  4. “Quick” data acceleration will help return the first results faster, but remember that the bucket still needs to be listed to look for new files.

Example 2: Large File Counts

Assume the object storage is partitioned in the following manner.

Code example
MyBucket/dataArchive/1989/12/13/ts_01.json.gz MyBucket/dataArchive/1989/12/13/ts_02.json.gz MyBucket/dataArchive/1989/12/13/ts_03.json.gz MyBucket/dataArchive/1989/12/13/ts_04.json.gz ... MyBucket/dataArchive/1989/12/13/ts_13.json.gz ...

(file count: 131989)

The partitioning expression would look like this.

Code example
MyBucket/dataArchive/${_time:%Y}/${_time:%m}/${_time:%d}

At first glance, this example looks very similar to the “good” example at the top of this blog, but the details matter. In this example, our partitioning only gets to the granularity of the day, and there are almost 132k files in this folder.

If data for a specific hour were needed, Cribl Search or Stream would have to filter the contents of every file in this directory to find data for the relevant hour. This would be a very tedious task.

To avoid this pitfall, here are two recommendations:

  1. If partitioning can be changed, add another level of partitioning for the hour after the day directory. This will allow Cribl to search the relevant hour(s) directly without accessing every file.

  2. If partitioning cannot be changed, detailed acceleration is a great solution. Enabling detailed acceleration for _time will help immensely. Detailed acceleration will check the bucket for new files, scanning and internally keeping metadata on each file’s earliest and latest event. This allows Cribl Search to open only the relevant files and those added since the last time metadata was generated.

The Easy Button

Suppose you want to eliminate the complexities of setting up object storage, defining retention policies, managing access, and keeping track of best practices. In that case, we have a solution for you: Cribl Lake!

Cribl Lake is a format-agnostic data lake that removes the complexities of managing your data. Cribl Lake handles access control, retention policies, partitioning, and more without manual configuration and that pesky command line.

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

More from the blog

get started

Choose how to get started

See

Cribl

See demos by use case, by yourself or with one of our team.

Try

Cribl

Get hands-on with a Sandbox or guided Cloud Trial.

Free

Cribl

Process up to 1TB/day, no license required.