Route data to multiple destinations
Enrich data events with business or service context
Search and analyze data directly at its source, an S3 bucket, or Cribl Lake
Reduce the size of data
Shape data to optimize its value
Store data in S3 buckets or Cribl Lake
Replay data from low-cost storage
Collect logs and metrics from host devices
Centrally receive and route telemetry to all your tools
Redact or mask sensitive data
Optimize data for better threat detection and response
Streamline infrastructure to reduce complexity and cost
Simplify Kubernetes data collection
Optimize logs for value
Control how telemetry is stored
Easily handle new cloud telemetry
Ensure freedom in your tech stack
Accelerate the value of AIOps
Effortlessly search, collect, process, route and store telemetry from every corner of your infrastructure—in the cloud, on-premises, or both—with Cribl. Try the Cribl Suite of products today.
Learn moreGet started quickly without managing infrastructure
Get telemetry data from anywhere to anywhere
Streamline collection with a scalable, vendor-neutral agent
Easily access and explore telemetry from anywhere, anytime
Store, access, and replay telemetry.
AI-powered tools designed to maximize productivity
Instrument, collect, observe
Get hands-on support from Cribl experts to quickly deploy and optimize Cribl solutions for your unique data environment.
Work with certified partners to get up and running fast. Access expert-level support and get guidance on your data strategy.
Get inspired by how our customers are innovating IT, security, and observability. They inspire us daily!
Read customer storiesFREE training and certs for data pros
Log in or sign up to start learning
Step-by-step guidance and best practices
Tutorials for Sandboxes & Cribl.Cloud
Ask questions and share user experiences
Troubleshooting tips, and Q&A archive
The latest software features and updates
Get older versions of Cribl software
For registered licensed customers
Advice throughout your Cribl journey
Connect with Cribl partners to transform your data and drive real results.
Join the Cribl Partner Program for resources to boost success.
Log in to the Cribl Partner Portal for the latest resources, tools, and updates.
September 20, 2022
Everybody is starting to look more at object storage to deliver on data lake initiatives, and S3, specifically Amazon S3, is the gold standard for that. In addition, we’ve heard from many of you that setting up S3 as a destination is a must when starting with Cribl Stream. So in this article we’ll walk you through the setup.
Customers tell us they use Cribl Stream and S3 to solve several problems:
Meet Compliance Mandates:
Strategically sending only the necessary data for triggering alerts and detections, while directing the remainder to cost-effective storage, significantly conserves budgetary resources.
Enhance Tool Performance:
Minimize data glut, transmitting only the essential data for reporting and alerting, while maintaining a ‘just in case’ backup copy at all times.
Have Peace of Mind:
Should you require a more comprehensive dataset for investigations or reporting, it is readily available in S3, poised to ‘replay’ only the necessary data into your analytics tools, providing precise reporting or additional context as needed.
Now, if this is your first time playing with Cribl, I suggest taking a few minutes (about 30) to go through the Affordable Log Storage Sandbox to familiarize yourself with what we’re going to be tackling today.
In the initial AWS S3 setup, you’ll want to configure the following fields:
If you need more information, feel free to visit our docs page dedicated to Amazon S3 Compatible Stores.
Before we go too far, there are a few things to keep in mind, because when we send data to S3, we don’t just read the data, process it and then send it directly to a bucket. We actually do something else with it that we call staging. The Worker Nodes stage files until certain limits are reached: Time open, idle time, size, or number of files. Once any of the configured limits are reached, the Worker optionally gzips the files and transfers them to the object store. The compression will make your storage bill much smaller by 8-15X over the course of time. In addition, the S3 Destination’s settings allow you to define how the uploaded files are partitioned. Cribl Stream creates subdirectories – Host, time, sourcetype, source – which are available to ease the Replay motion in the future. These are all things to keep in mind as you set up your S3 destination in Cribl Stream.
Things can get unruly pretty quick when dealing with APIs, open files, and queuing, so keep these values in mind as you tackle the next steps of the configuration.
AWS S3 API Rate Limits:
Cribl Stream
Linux OS
Partition Expressions essentially define how files are partitioned and organized as metadata from the events is added to the S3 bucket directory structure.
If this is left blank, Cribl Stream will fall back to the event’s __partition
field value (if present); or otherwise to the root directory of the Output Location and Staging Location.
This becomes very important when you’re Replaying data from the store. The partitions will make your Replay searches faster, when you need to fetch data in the future. We can map segments of the path back to variables (including time) that you can use to zero in on the exact logs you need to replay, without requiring checking _raw. Ultimately, it allows data to be filtered on easily and efficiently – and we recommend creating partition expressions on commonly searched fields.
You do, however, want to be careful with this partitioned expression. You don’t want to have what’s called cardinality be too high. The cardinality has to be reasonable otherwise you’ll have way too many open files and potentially some other queueing issues as a result.
The next tricky field you want to consider is the file name prefix expression, located just below the partition expression in the UI. The File name prefix expression allows you to create a partition, basically different names for files, not just the directories where they reside. They’re based on metadata from the events. The output filename prefix defaults to CriblOut.
You do want to be careful here as this can be both good and bad in the sense that it can negatively impact both read and write API count and can dramatically increase the number of open files. Creating too many of these files can accumulate the number of files in one directory and you can exceed your API loads. It can be super useful in Replay filters but we recommend avoiding unless you’ve done due diligence and have low cardinality partition expressions.
With a cardinality that’s too high, you run the risk of having too many open files. Ask yourself – can your system handle 100,000 or 200,000 open files? Probably not. So it’s important to keep that in mind.
On the other hand, if your cardinality is too low, then you have less filtering ability during replays as well as a larger chance of hitting your API limits with S3, especially when you’re reading.
So, what is the magic number of partition expression cardinality that you should consider? The key takeaway here is 2000 as the maximum values you can have for any cardinality. Now this number is subject to change, you can have it higher. But 2000 is good to use as a baseline, a good number to target. Within the Cribl UI, in a destination’s Advanced Settings, there is a max open file setting which defaults to 100, with a max setting of 2000.
Here’s a good example of partition expression:
`${C.Time.strftime(_time ? _time : Date.now(), '%Y/%m/%d/%H')}/${state}/${interface}/${server_ip}`
state: 2 possible values
interface: 50 possible values
server_ip: 20 possible values
Potential Cardinality: 2 x 100 x 10 = 2,000
With Partition expression and file name prefix expression now set, let’s move to Max File Open Settings. We talked about 2000 as the maximum cardinality for an S3 destination. Remember that the configuration of cardinality is on a per worker process. So, if we have a cardinality of 2000, any one worker process might have 2000 files open at any one moment in time. If you have a 32 CPU system, that means you have 30 worker processes that is 30 times 2000, open files each, that’s 60,000 potentially open files on the system. You would need to make sure that system is updated to accommodate that many open files.
This is where you can run these various commands found under the advanced settings to update these various files, keep track of and validate and check the limits settings to ensure you’re good to go. Cribl Stream will close files when either of the Max file size (MB) OR the Max file open time (sec) conditions are met. Don’t forget to verify your settings:
# sysctl -w fs.file-max=65536 (as root)
# vi /etc/sysctl.conf (as root)
fs.file-max = 65536
# vi /etc/security/limits.conf (as Cribl user)
* soft nproc 65536
* hard nproc 65536
* soft nofile 65536
* hard nofile 65536
Verify settings:
# sysctl -p
# cat /proc/sys/fs/file-max
# sysctl fs.file-max
Now that we’re here, let’s take a look at tuning this S3 destination. A good practice is to specifically increase the max open files to avoid getting ‘too many open files’ errors. While the max open files cannot be greater than 2000, it’s not always a good idea to keep it at the max unless you have your operating system all set up, you have enough disk space, and you really did your work with cardinality. When you’re trying to figure out your cardinality, match the timestamp that you’re searching in your analytic tool against what you plan on having here as the maximum number of unique values we have over a given time will affect total cardinality.
Another setting to call out is the storage class – setting the storage class to something like Standard, Infrequent Access or Intelligent Tiering is good practice for low access use cases such as threat hunting or compliance audits. In this case, intelligent tearing will automatically take older data that is not searched, or that has not been fetched at all before, and put it into a lower tier. For more information on Amazon S3 pricing tiers, visit Amazon’s pricing page.
The last thing to note on the tuning side is the data format that you write to S3. By default, it’s JSON. This means you’re going to have the raw events as one field. And then you’re going to have a bunch of Cribl fields and other fields as metadata that are going to be shipped into these events as a JSON payload.
Now on the other side, _raw
, just takes the original Syslog event or the original event that came from a Cloud trail, for example, and sends that on to the S3 bucket as is.
Don’t forget to set compression as this is usually something that will dramatically lower your storage costs – typically with compression you get anywhere from 70 to 90%.
Another setting worth calling out is ‘remove staging dirs’. This setting should always be enabled. Cribl will regularly check if any directories are empty and remove them. If this is not set, you can easily have millions of empty directories and run out of disk inodes.
And the last piece to configure is the notifications. This allows for notifications to be sent if a destination is bad or when back pressure is activated. It’s always good to know the health of your pipelines.
And to close things off, let’s chat authentication. This can be set using the buttons to select one of these options:
Voilà! If you’re here, that should mean S3 is set up and configured correctly and is ready to accept data! In addition, most of what we covered in this post is actually part of the Cribl Certified Observability Engineer (CCOE) Stream Admin certification offered by Cribl. For more in-depth, practically-oriented courses and content to help you become a Stream expert, don’t forget to check out Cribl University. For more tips and tricks on how to set up and tune S3, follow the links below:
Patrick Wade Nov 20, 2024
Cam Borgal Nov 11, 2024
Classic choice. Sadly, our website is designed for all modern supported browsers like Edge, Chrome, Firefox, and Safari
Got one of those handy?