The more customers I talk to, the more I see a trend toward wanting a low-cost vendor-agnostic data lake. Customers want the freedom to store their data long-term and typically look to object stores from AWS, Azure, and Google Cloud.
To optimize for data access, users will partition their data into directories to optimize for use cases such as Cribl Replay and Cribl Search. Only relevant files will be accessed for rehydration or search by partitioning data.
Example what partitioning may look like:
MyBucket/dataArchive/2023/03/17/glendaleAz_1.json.gz
MyBucket/dataArchive/2023/03/18/glendaleAz_2.json.gz
...
MyBucket/dataArchive/2023/03/24/lasVegasNv_1.json.gz
MyBucket/dataArchive/2023/03/25/lasVegasNv_2.json.gz
When it comes to accessing object stores, all the major cloud providers use APIs compatible with S3. When accessing the object store, the first stage consists of listing out the relevant files through a listObjects API call. To understand what files are relevant, a partitioning schema is defined. It will look something like this for the files in the partitioning example:
MyBucket/dataArchive/${_time:%Y}/${_time:%m}/${_time:%d}
Cribl Stream or Search will use the time picker to fill in the year, month, and day variables. This will result in the object storage API returning a list of files in the appropriate time range. After that, each file can be accessed, parsed, and handled to meet the user’s request.
Assume the object storage is partitioned in the following manner.
MyBucket/dataArchive/importantData/2023/03/31/arlingtonTx_1.json.gz
MyBucket/dataArchive/eventMoreImportantData/2023/04/01/arlingtonTx_2.json.gz
MyBucket/dataArchive/superImportantData/2023/04/02/arlingtonTx_3.json.gz
...
MyBucket/dataArchive/superDuperImportantData/2023/04/13/tampaFl_1.json.gz
The partitioning expression would look like this.
MyBucket/dataArchive/${dataImportance}/${_time:%Y}/${_time:%m}/${_time:%d}
This partitioning scheme is great for segmenting data for human access. If I need to access business-critical data, I know exactly where to go, and if I need to access lesser-important data, it’s already in a different location. This also simplifies setting S3 retention policies, especially when data types have different retention requirements.
The pitfall here is that if data needs to be accessed by date, and dataImportance is not defined as part of the Cribl Search query or the Cribl Stream collector filter, every file under the dataArchive directory will have to be listed. The listObjects
API call only allows for static prefixes, which means directory paths without wildcards. The API call must be specified as every file under MyBucket/dataArchive since dataImportance is not defined.
To avoid this pitfall, here are three recommendations:
dataImportance
values is a limited set of values. In that case, each can be defined as an individual dataset with the value defined statically as part of the partitioning expression. This ensures that the value is defined for every search.dataImportance
partition to the right of the date and time folders can allow filtering by time.dataImportance
as part of the search or filter.Assume the object storage is partitioned in the following manner.
MyBucket/dataArchive/1989/12/13/ts_01.json.gz
MyBucket/dataArchive/1989/12/13/ts_02.json.gz
MyBucket/dataArchive/1989/12/13/ts_03.json.gz
MyBucket/dataArchive/1989/12/13/ts_04.json.gz
...
MyBucket/dataArchive/1989/12/13/ts_13.json.gz
...
(file count: 131989)
The partitioning expression would look like this.
MyBucket/dataArchive/${_time:%Y}/${_time:%m}/${_time:%d}
At first glance, this example looks very similar to the “good” example at the top of this blog, but the details matter. In this example, our partitioning only gets to the granularity of the day, and there are almost 132k files in this folder.
If data for a specific hour were needed, Cribl Search or Stream would have to filter the contents of every file in this directory to find data for the relevant hour. This would be a very tedious task.
To avoid this pitfall, here are two recommendations:
Suppose you want to eliminate the complexities of setting up object storage, defining retention policies, managing access, and keeping track of best practices. In that case, we have a solution for you: Cribl Lake!
Cribl Lake is a format-agnostic data lake that removes the complexities of managing your data. Cribl Lake handles access control, retention policies, partitioning, and more without manual configuration and that pesky command line.
Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.
We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.
Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.