The more customers I talk to, the more I see a trend toward wanting a low-cost vendor-agnostic data lake. Customers want the freedom to store their data long-term and typically look to object stores from AWS, Azure, and Google Cloud.
To optimize for data access, users will partition their data into directories to optimize for use cases such as Cribl Replay and Cribl Search. Only relevant files will be accessed for rehydration or search by partitioning data.
Example what partitioning may look like:
MyBucket/dataArchive/2023/03/17/glendaleAz_1.json.gz
MyBucket/dataArchive/2023/03/18/glendaleAz_2.json.gz
...
MyBucket/dataArchive/2023/03/24/lasVegasNv_1.json.gz
MyBucket/dataArchive/2023/03/25/lasVegasNv_2.json.gz
When it comes to accessing object stores, all the major cloud providers use APIs compatible with S3. When accessing the object store, the first stage consists of listing out the relevant files through a listObjects API call. To understand what files are relevant, a partitioning schema is defined. It will look something like this for the files in the partitioning example:
MyBucket/dataArchive/${_time:%Y}/${_time:%m}/${_time:%d}
Cribl Stream or Search will use the time picker to fill in the year, month, and day variables. This will result in the object storage API returning a list of files in the appropriate time range. After that, each file can be accessed, parsed, and handled to meet the user’s request.
What NOT to Do, and How to Avoid Common Pitfalls
Example 1: Wildcards in Partitioning
Assume the object storage is partitioned in the following manner.
MyBucket/dataArchive/importantData/2023/03/31/arlingtonTx_1.json.gz
MyBucket/dataArchive/eventMoreImportantData/2023/04/01/arlingtonTx_2.json.gz
MyBucket/dataArchive/superImportantData/2023/04/02/arlingtonTx_3.json.gz
...
MyBucket/dataArchive/superDuperImportantData/2023/04/13/tampaFl_1.json.gz
The partitioning expression would look like this.
MyBucket/dataArchive/${dataImportance}/${_time:%Y}/${_time:%m}/${_time:%d}
This partitioning scheme is great for segmenting data for human access. If I need to access business-critical data, I know exactly where to go, and if I need to access lesser-important data, it’s already in a different location. This also simplifies setting S3 retention policies, especially when data types have different retention requirements.
The pitfall here is that if data needs to be accessed by date, and dataImportance is not defined as part of the Cribl Search query or the Cribl Stream collector filter, every file under the dataArchive directory will have to be listed. The listObjects
API call only allows for static prefixes, which means directory paths without wildcards. The API call must be specified as every file under MyBucket/dataArchive since dataImportance is not defined.
To avoid this pitfall, here are three recommendations:
Suppose the number of
dataImportance
values is a limited set of values. In that case, each can be defined as an individual dataset with the value defined statically as part of the partitioning expression. This ensures that the value is defined for every search.If the partitioning expression can be changed, defining the
dataImportance
partition to the right of the date and time folders can allow filtering by time.If nothing else can be changed, try to define
dataImportance
as part of the search or filter.“Quick” data acceleration will help return the first results faster, but remember that the bucket still needs to be listed to look for new files.
Example 2: Large File Counts
Assume the object storage is partitioned in the following manner.
MyBucket/dataArchive/1989/12/13/ts_01.json.gz
MyBucket/dataArchive/1989/12/13/ts_02.json.gz
MyBucket/dataArchive/1989/12/13/ts_03.json.gz
MyBucket/dataArchive/1989/12/13/ts_04.json.gz
...
MyBucket/dataArchive/1989/12/13/ts_13.json.gz
...
(file count: 131989)
The partitioning expression would look like this.
MyBucket/dataArchive/${_time:%Y}/${_time:%m}/${_time:%d}
At first glance, this example looks very similar to the “good” example at the top of this blog, but the details matter. In this example, our partitioning only gets to the granularity of the day, and there are almost 132k files in this folder.
If data for a specific hour were needed, Cribl Search or Stream would have to filter the contents of every file in this directory to find data for the relevant hour. This would be a very tedious task.
To avoid this pitfall, here are two recommendations:
If partitioning can be changed, add another level of partitioning for the hour after the day directory. This will allow Cribl to search the relevant hour(s) directly without accessing every file.
If partitioning cannot be changed, detailed acceleration is a great solution. Enabling detailed acceleration for _time will help immensely. Detailed acceleration will check the bucket for new files, scanning and internally keeping metadata on each file’s earliest and latest event. This allows Cribl Search to open only the relevant files and those added since the last time metadata was generated.
The Easy Button
Suppose you want to eliminate the complexities of setting up object storage, defining retention policies, managing access, and keeping track of best practices. In that case, we have a solution for you: Cribl Lake!
Cribl Lake is a format-agnostic data lake that removes the complexities of managing your data. Cribl Lake handles access control, retention policies, partitioning, and more without manual configuration and that pesky command line.