Cribl has closed our Series B funding from Sequoia Capital! Learn More

A Simple Guide to Scalable Data Collection from Amazon S3

Dritan Bitincka
Written by Dritan Bitincka

March 16, 2020

Scalable data collection from Amazon S3 was introduced back in Cribl LogStream 2.0 and has been a real workhorse providing essential capabilities to many of our AWS customers. In this post we’ll take a look at how it works and how to configure it.

If you’re new to Cribl LogStream you may want to take our sandbox for a drive before reading further.

How does it work

Reading data from S3 buckets is usually a fairly simple task; issue an API call to Amazon S3 and read its output. However, a number of challenges need to be addressed once you need to periodically scan S3 buckets for new data. First, listing of objects/files in a S3 bucket can be rather expensive if there are thousands or millions of them therein. Second, keeping track of what is currently read and who’s reading what can become a real issue in a distributed environment. And third, resiliency is all to be figured out by the reader.

One way to address these issues is by using event notifications through Amazon SQS.

  1. A new object (data) lands on an S3 bucket.
  2. S3 sends a notification message on an SQS queue
  3. LogStream workers processes are configured as SQS queue consumers
    1. Each worker process reads messages while SQS marks them invisible to others. This ensures that no two workers processes read the same message.
    2. Each worker process extracts the S3 object path from each message
  4. Each worker process then goes into S3 and fetches the object.
  5. Each worker process then deletes its messages from SQS.

Benefits

  • Fast delivery – there is no need for periodic scanning/listing of buckets. All data is read as soon as notifications make it to SQS, in near real-time.
  • Improved resiliency – if one Worker Process stops processing or becomes unavailable, SQS will make its messages visible to others.
  • Better scalability – higher (read) throughput can simply be achieved through adding more Worker Processes

Cribl White Paper: 6 Techniques to Control Log Volume

Learn how to cut costs with Cribl LogStream.

Download Now


Configuration on AWS side

In this example we’re assuming a simple setup where we only have one S3 bucket and one SQS queue. While you may have multiple buckets sending notifications to one or more queues, the configurations are nearly identical in principle.

  • Create an SQS queue that will receive direct S3 notifications. Note its ARN.
  • Setup an Amazon S3 bucket that collects logs/events.
  • Configure the SQS queue with a Policy to accept S3 notification events.
    • In its Permissions tab, click Edit Policy Document (Advanced) and replace (current) access policy with the one below
      • Replace SQS-Queue-ARN-Goes-Here and Bucket-Name-Goes-Here as necessary.
{
    "Version": "2012-10-17",
    "Id": "example-ID",
    "Statement": [
        {
            "Sid": "example-statement-ID",
            "Effect": "Allow",
            "Principal": {
                "AWS":"*"
            },
            "Action": [
                "SQS:SendMessage"
            ],
            "Resource": "SQS-Queue-ARN-Goes-Here",
            "Condition": {
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:s3:*:*:Bucket-Name-Goes-Here"
                }
            }
        }
    ]
}
  • Next, configure the S3 bucket to send notifications for all s3:ObjectCreated:* events to the SQS queue above.
    • While in the S3 bucket go to Properties > Events.
    • Add a notification by selecting All object create events,
    • Under Send to choose SQS Queue and select the queue from above.

Notifications can be additionally configured for subsets of object prefixes (i.e. notify only on creates on certain “folders”) or suffixes (i.e. notify only on creates of certain “file” extensions).

To confirm that notification events are setup correctly, add/upload a sample file to S3 and check SQS console for new messages.

Configuration on LogStream side

Before we start configuring LogStream let’s make sure we have all the correct permissions in place. LogStream can use instance’s IAM roles (if running on AWS) or AWS Key ID/Secret Access Key credentials to reach out to SQS and S3. In either case, the user or the role must have enough permissions in place to read objects from S3 and list/read/delete messages from SQS:

## S3
s3:GetObject

## SQS
sqs:ListQueues
sqs:SendMessage
sqs:SendMessageBatch
sqs:CreateQueue
sqs:GetQueueAttributes
sqs:SetQueueAttributes
sqs:GetQueueUrl

Navigate to  Sources > AWS > S3 and click + Add New.

  • Enter the SQS queue name from above
  • Optionally use a Filename Filter that properly describes your objects of interest.
  • Enter API Key and Secret Key unless using an IAM role
  • Select a Region where SQS queue and S3 bucket are located.
  • Under Event Breakers add Rulesets as necessary.
  • Under Advanced Settings. Change only if really necessary.
    • Max Messages: This is the number of messages each receiver (below) can get on each poll.
    • Visibility Timeout: This is the duration (in seconds) that the received messages are hidden from subsequent retrieve requests after being retrieved by Worker Process. The default value of 600s. Practically speaking, this is the time each Worker Process is given to fetch and process an S3 object.
    • Number of Receivers: This is the number of SQS pollers to run per Worker Process.

Best Practices

  • When LogStream instances are deployed on AWS, use IAM Roles whenever possible.
    • Not only it’s safer but the configuration is also simpler to maintain.
  • Although optional, we highly recommend you use a Filename Filter.
    • This will ensure that only files of interest are ingested by LogStream.
    • Ingesting only what’s strictly needed improves latency, processing power and data quality.
  • If higher throughput is needed, increase Number of Receivers under Advanced Settings. However do note:
    • This is set at 3 by default. Which means, each Worker Process in each LogStream Worker Node will run 3 receivers!
    • Increased throughput comes with additional CPU utilization.
  • When ingesting large files tune up Visibility Timeout or consider using smaller objects.
    • The default value of 600s works well in most cases, and while you certainly can increase it we also suggest you consider using smaller S3 objects.

New to Cribl LogStream? Take our Sandbox for a drive!

If you have any questions or feedback join our community Slack– we’d love to help you out. If you’re looking to join a fast-paced, innovative team drop us a line at hello@cribl.io– we’re hiring!

Additional Reading

Questions about our technology? We’d love to chat with you.