Solutions

Use Cases

Initiatives

Technologies

Industries

Route
Route data to multiple destinations

Enrich
Enrich data events with business or service context

Search
Search and analyze data directly at its source, an S3 bucket, or Cribl Lake

Reduce
Reduce the size of data

Transform
Shape data to optimize its value

Store
Store data in S3 buckets or Cribl Lake

Replay
Replay data from low-cost storage

Collect
Collect logs and metrics from host devices

Universal Receiver
Centrally receive and route telemetry to all your tools

Redact
Redact or mask sensitive data

Interactive Demos See all Integrations

Supercharge Security Insights
Optimize data for better threat detection and response

Agent Consolidation
Streamline infrastructure to reduce complexity and cost

Tackle Application Infrastructure Sprawl
Simplify Kubernetes data collection

Reduce Log Volume
Optimize logs for value

Slash Storage Costs
Control how telemetry is stored

Accelerate Cloud Migration
Easily handle new cloud telemetry

Avoid Vendor Lock-In
Ensure freedom in your tech stack

AIOps Optimization
Accelerate the value of AIOps

Interactive Demos See all Integrations

See all Integrations

Seamless Integrations to Power All Your Tools See all Integrations

Interactive Demos See all Integrations

Healthcare

Managed Security Services

Manufacturing & Logistics

Media & Entertainment

Public Sector

Retail

Financial Services

Interactive Demos See all Integrations
Products

Overview

Products

Services

Cribl Products Overview

Effortlessly search, collect, process, route and store telemetry from every corner of your infrastructure—in the cloud, on-premises, or both—with Cribl. Try the Cribl Suite of products today.
Learn more

Learn more

Featured News Story
Cribl closes $319M oversubscribed Series E at $3.5B valuation!
Learn more

Interactive Demos Pricing Support

Stream
Get telemetry data from anywhere to anywhere

Cribl.Cloud
Get started quickly without managing infrastructure

Edge
Streamline collection with a scalable, vendor-neutral agent

Copilot
AI-powered tools designed to maximize productivity

Search
Easily access and explore telemetry from anywhere, anytime

Appscope
Instrument, collect, observe

Lake
Store, access, and replay telemetry.

Interactive Demos Pricing Support

Activation Services
Get hands-on support from Cribl experts to quickly deploy and optimize Cribl solutions for your unique data environment.

Service Delivery Partners
Work with certified partners to get up and running fast. Access expert-level support and get guidance on your data strategy.

Interactive Demos Pricing Support
Customers

Customer Stories

Customer Highlights

Customer Stories

Get inspired by how our customers are innovating IT, security, and observability. They inspire us daily!
Read customer stories

Watch now

In Action!
See how our customers use Cribl as their data engine for IT and Security
Watch now

Sally Beauty
Replacing LogStash and Syslog-ng with a resilient pipeline
Learn more

Yale New Haven
Reducing SIEM burden and revamping security infrastructure
Learn more

Aflac
Gotta catch 'em all! Simplifying data onboarding across sources
Learn more

SAP
Accelerating SAP Enterprise Cloud Services' security initiatives
Learn more

Autodesk
Metrics, OTel and more: Modernizing an enterprise data pipeline
Learn more

Nutanix
Reducing firewall log volume by 50%
Learn more
Learning & Resources

Learning

Cribl University
FREE training and certs for data pros

Cribl University LogIn
Log in or sign up to start learning

Docs

Tech Docs
Step-by-step guidance and best practices

Self Guided Trials
Tutorials for Sandboxes & Cribl.Cloud

Community

Slack
Ask questions and share user experiences

Curious Knowledge Base
Troubleshooting tips, and Q&A archive

Downloads

Download Library
The latest software features and updates

Past Releases
Get older versions of Cribl software

Support

Support Portal
For registered licensed customers

Customer Success
Advice throughout your Cribl journey

Blog & Podcasts

Events

Webinars

Briefs & Papers

Packs

GitHub Repos

Docker Hub

Glossary

Telemetry 101

Observability 101
Pricing

Plans

ROI calculator
About

Cribl

Partners

About Cribl

Transform data management with Cribl, the Data Engine for IT and Security.
Learn more

Company Careers News Contact Leadership Cribl for Startups

Learn more

Featured News Story
Cribl closes $319M oversubscribed Series E at $3.5B valuation!
Learn more

Find a Partner
Connect with Cribl partners to transform your data and drive real results.

Partner Program
Join the Cribl Partner Program for resources to boost success.

Partner Login
Log in to the Cribl Partner Portal for the latest resources, tools, and updates.

Exporting Splunk Data at Scale: See a Need, Fill a Need

Written by Jim Apger

July 27, 2022

The Core Splunk platform is rightfully recognized as having sparked the log analytics revolution when viewed through the lenses of ingest, search speed, scale, and usability. Their original approach leveraged a MapReduce approach, and it still stores the ingested data on disk in a collection of flat files organized as “buckets.” These immutable buckets are not human-readable and largely consist of the original raw data, indexes (.tsidx files), and a bit of metadata. Read on as we dive into exporting Splunk data at scale, and how it’s made easy by Cribl Stream and Cribl.Cloud.

As Splunk continues to make progress migrating their customers from on-prem or self-hosted platforms to their Splunk Cloud offering, the big question is “what happens to all of the previously ingested data?”. The current Splunk-supported options for exporting data do not address the portability or reusability of this historical data from a scale perspective, even when migrating to Splunk Cloud. Do you keep licensing in place for your historical data until your investigation, hunting, threat intelligence, and regulatory/compliance retention timelines have been satisfied? Someone needs to step up and build a solution for moving that historical data into Splunk Cloud or any other platform that best fits your needs.

I think back on the 2005 movie Robots from time to time and the recurring “See a Need, Fill a Need” theme. The older robots are faced with either expensive upgrades or termination for reasons I don’t want to spoil in this blog. An inventive hero named Rodney Copperbottom stands up against the monopolistic Bigweld Industries corporation to save the day by being the single brave sole armed with his curiosity and persistence.

Splunk customers need a solution to address the portability of this previously-ingested data that is neither expensive nor leaves the data for extinction. It needs to scale, it needs to leverage native Splunk functionality, and it needs to provide the flexibility to allow you, the customer, to format, transit, and use the data exactly as you need. Choice. The solution outlined here involves no proprietary information and you are free to make use of it, with or without Cribl, to migrate your Splunk ingested data into Splunk Cloud, object storage, or any other destination.

The Solution

Splunk includes a command-line switch called “exporttool” which provides the ability to retrieve the original raw events as they were originally indexed by Splunk. As a very important bonus, exporttool also provides metadata such as the original source name, sourcetype, and index time that allow us to organize, route, optimize, and reuse or even replay that data perpetually or on-demand. Exporttool is very fast and can either write data to disk or stream the data directly from the indexer using the CLI. Examples of how to use this command when exporting Splunk data are detailed below.

Export to stdout:

/opt/splunk/bin/splunk cmd exporttool /opt/splunk/var/lib/splunk/bots/db/db_1564739504_1564732800_2394 /dev/stdout -csv

Export to a local csv file:

/opt/splunk/bin/splunk cmd exporttool /opt/splunk/var/lib/splunk/bots/db/db_1564739504_1564732800_2394 /exports/bots/db_1564739504_1564732800_2394.csv -csv

Writing to disk is generally not optimal considering that the uncompressed raw data will likely fill your disks and you are still left with decisions for how to best transit, organize, route, and optimize this data to suit your needs. The streaming option of exporttool bypasses storage and disk IO concerns but is still single-threaded which is something we will solve.

Allow me to introduce you to a python script that streams data from Splunk Indexers into Cribl using exporttool via a python script we call scribl.py. This script provides the parallelization of both the exporttool functionality and the transport of data off of each indexer. You point the script at an index of your choosing from your indexer CLI, tell it how many CPUs to use, then assign your Cribl Stream destination IP and TCP port. There are a couple of other options, such as transport via TLS or which time ranges you would like to export, but that’s it. Nothing fancy or overly complicated. Again, send directly to Splunk Cloud for licensed ingest, to object storage, to another platform, or to Cribl for organization, routing, optimization, replay, etc., before sending to any of the aforementioned destinations.

Here is a subtle but important point to consider while exporting from Splunk Indexers: Regardless of your use case, a significant percentage of your ingested data may be overly verbose given the purpose-built needs of many of your destination platforms. If you are moving your previously indexed Splunk data to Splunk Cloud, this is your chance to hit the “reset” button and have Cribl Stream reformat data “in-flight” to ensure you free up ingest volume for additional data sources, reduce storage requirements related to retention, decrease your search concurrency (CPU count), and improve cost/performance across the board.

Any gotchas? The scribl script running on your indexers and the Cribl Stream workers are built to scale and will not be your bottleneck. My scale testing shows Scribl throughput on a single Splunk indexer writing results to local /dev/null/ (no network involved) of more than 19 Gb/sec when assigning 30 CPUs. The same config pushes more than 11 Gb/sec when writing to a single Cribl worker over the network within the same AWS availability zone.

Your bottlenecks will almost certainly be bandwidth constraints between your indexers and your final destination. Depending on where you deploy your Cribl Stream workers, that bandwidth bottleneck might exist between the indexers and Cribl workers or between your Cribl workers and the final destination. If you happen to have unlimited bandwidth, you might find your bottleneck to be the ingest rate at your destination platform.

Exporting Splunk Data: The Details

With exporting Splunk Data, you will need:

CLI access to each Linux indexer with the index/buckets that need to be exported which means this process only applies to on-prem or non-SplunkCloud deployments.
To install netcat on each indexer to act as the transport mechanism.
To make sure outbound communication from each indexer to the Cribl Worker TCP port is open.

Scribl.py is a python script that parallelizes native Splunk exporttool functionality available in the Core Splunk platform as long as you have CLI access to your indexers and can install the OG netcat utility on each of your indexers. Since python is embedded in every Splunk install, all you need to do is copy/paste the script from the repo, install netcat if not already installed, and follow the repo instructions for configuring Cribl Stream.

Scribl needs to be run on each of your indexers independently which is where we achieve an even higher degree of scale by bypassing any bottlenecks related to aggregation components like a search head. As detailed below, scribl will reach into the index that you want to export, build a list of all buckets that need to be exported, then balance the exporting and transiting of the data in each bucket across the number of CPUs that you dedicate to the export process.

The data will be csv formatted into the following fields when delivered to Cribl Stream:

_time
Source
Host
Sourcetype
_raw
_meta

Cribl Stream performs the following while exporting Splunk:

Load-balancing across multiple Stream Worker nodes for scale.
Event breaking to handle multi-line events with the most common example being Windows Logs.
Provide the ability to filter by original sourcetype for granular optimization (reduction, dedup, aggregation, etc) and routing of data to its final destination.
Ensure the _time value represents the original event time in the _meta field.
Remove the _meta field after it has been referenced for _time.
Perform some formatting to clean up the exporttool output.

The scribl github repo provides you with the scribl.py script, instructions for usage, and for configuring Cribl Stream. Don’t forget, you can use a Cribl.Cloud instance to stream your exported Splunk data to for testing in a matter of minutes. If you want to take scribl for a spin in your lab environment, one of the best ways to do this is to use one of Splunk’s open-sourced Boss of the SOC datasets. They contain pre-indexed data (many sourcetypes) in a single index which you can have scribl export and stream to your Cribl Stream worker node(s) then on to Splunk Cloud or any other platform.

Additional resources you may find useful during your testing of exporting Splunk data include Cribl Sandbox, or the Cribl Docs. For more information on Cribl’s solutions, visit our Podcast, LinkedIn, Twitter, or Slack community.