Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

How Cribl Search Can Save You From Drowning in a Deluge of Data
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Avoid Vendor Lock-in ›

Agent Consolidation ›

Slash Storage Costs ›

Free Up Space for High-Value Data ›

Route From Any Source To Any Destination ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

July 31 | 10am PT / 1pm ET

Navigating the Data Current Report: Transforming IT & Security Operations in 2024
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

A Simple Guide to Scalable Data Collection from Amazon S3

Written by Dritan Bitincka

March 16, 2020

Scalable data collection from Amazon S3 was introduced back in Cribl LogStream 2.0 and has been a real workhorse, providing essential capabilities to many of our AWS customers. In this post, we’ll take a look at how it works and how to configure it.

If you’re new to Cribl LogStream you may want to take our sandbox for a drive before reading further.

How does it work

Reading data from S3 buckets is usually a fairly simple task: issue an API call to Amazon S3 and read its output. However, a number of challenges need to be addressed once you need to periodically scan S3 buckets for new data. First, listing of objects/files in a S3 bucket can be rather expensive if there are thousands or millions of them therein. Second, keeping track of what is currently read and who’s reading what can become a real issue in a distributed environment. And third, resiliency is all to be figured out by the reader.

One way to address these issues is by using event notifications through Amazon SQS.

A new object (data) lands on an S3 bucket.
S3 sends a notification message on an SQS queue.
LogStream Worker Processes are configured as SQS queue consumers.Each Worker Process reads messages, while SQS marks them invisible to others. This ensures that no two Workers’ processes read the same message.Each Worker Process extracts the S3 object path from each message.
Each Worker Process then goes into S3 and fetches the object.
Each Worker Process then deletes its messages from SQS.

Benefits

Fast delivery – there is no need for periodic scanning/listing of buckets. All data is read as soon as notifications make it to SQS, in near real-time.
Improved resiliency – if one Worker Process stops processing, or becomes unavailable, SQS will make its messages visible to others.
Better scalability – higher (read) throughput can be achieved simply through adding more Worker Processes.

Learn How to Lower Your Splunk Costs With Cribl Stream

Learn More

Configuration on AWS side

In this example we’re assuming a simple setup where we only have one S3 bucket and one SQS queue. While you may have multiple buckets sending notifications to one or more queues, the configurations are nearly identical in principle.

Create an SQS queue that will receive direct S3 notifications. Note its ARN.
Set up an Amazon S3 bucket that collects logs/events.
- Examples: CloudTrail, VPC Flowlogs, CloudFront, etc.
Configure the SQS queue with a Policy to accept S3 notification events.
- In its Permissions tab, click Edit Policy Document (Advanced) and replace the (current) access policy with the one below.
- Replace SQS-Queue-ARN-Goes-Here and Bucket-Name-Goes-Here as necessary.

 [
        {
            "Sid": "example-statement-ID",
            "Effect": "Allow",
            "Principal": {
                "AWS":"*"
            },
            "Action": [
                "SQS:SendMessage"
            ],
            "Resource": "SQS-Queue-ARN-Goes-Here",
            "Condition": {
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:s3:*:*:Bucket-Name-Goes-Here"
                }
            }
        }
    ]
}

Next, configure the S3 bucket to send notifications for all s3:ObjectCreated:* events to the SQS queue above.
- While in the S3 bucket, go to Properties > Events.
- Add a notification by selecting All object create events.
- Under Send to, choose SQS Queue, and select the queue from above.

Notifications can be additionally configured for subsets of object prefixes (i.e., notify only on creates on certain “folders”) or suffixes (i.e., notify only on creates of certain “file” extensions).

To confirm that notification events are set up correctly, add/upload a sample file to S3 and check the SQS console for new messages.

Configuration on LogStream side

Before we start configuring LogStream, let’s make sure we have all the correct permissions in place. LogStream can use an instance’s IAM roles (if running on AWS), or AWS Key ID/Secret Access Key credentials, to reach out to SQS and S3. In either case, the user or the role must have enough permissions in place to read objects from S3, and to list/read/delete messages from SQS:

## S3
s3:GetObject
s3:ListBucket

## SQS
sqs:ReceiveMessage
sqs:DeleteMessage
sqs:ChangeMessageVisibility
sqs:GetQueueAttributes
sqs:GetQueueUrl

Navigate to Sources > Amazon > S3 and click + Add New.
Enter the SQS queue name from above
Optionally, use a Filename Filter that properly describes your objects of interest.
Enter the API Key and Secret Key, unless using an IAM role.
Select a Region where the SQS queue and S3 bucket are located.
Under Event Breakers, add Rulesets as necessary.
Under Advanced Settings, change the following only if really necessary:
- Num Receivers: This is the number of SQS pollers to run per Worker Process.
- Max Messages: This is the number of messages each receiver (below) can get on each poll.
- Visibility Timeout Seconds: This is the duration over which received messages, after being retrieved by a Worker Process, are hidden from subsequent retrieve requests. The default value is 600s. Practically speaking, this is the time each Worker Process is given to fetch and process an S3 object.

Best practices

When LogStream instances are deployed on AWS, use IAM Roles whenever possible.
- Not only is it safer, but the configuration is also simpler to maintain.
Although optional, we highly recommend that you use a Filename Filter.
- This will ensure that only files of interest are ingested by LogStream.
- Ingesting only what’s strictly needed improves latency, processing power, and data quality.
If higher throughput is needed, increase Number of Receivers under Advanced Settings. However, do note:
- This is set at 3 by default. Which means, each Worker Process in each LogStream Worker Node will run 3 receivers!
- Increased throughput comes with additional CPU utilization.
When ingesting large files, tune up the Visibility Timeout, or consider using smaller objects.
- The default value of 600s works well in most cases, and while you certainly can increase it, we also suggest you consider using smaller S3 objects.

—

New to Cribl LogStream? Take our Sandbox for a drive!

If you have any questions or feedback join our community Slack– we’d love to help you out. If you’re looking to join a fast-paced, innovative team drop us a line at hello@cribl.io– we’re hiring!

Return to Cribl Blog

Additional Reading

Preventing Friction With an Impactful Security Champions Program

Liam McGovern Jul 24, 2024

Securing the Foundation of Cribl Copilot

Zach Rayburn Jul 22, 2024

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Download Library

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us