This is Part One of a series of blogs around troubleshooting Cribl Stream. Part One will focus on identifying and troubleshooting issues with Sources and Destinations in Stream. I will cover some of the common problems that users face and how you can work through them and find the root cause.
Is Something Wrong? How to Identify Issues
The first step in troubleshooting any issue is identifying what that issue is to understand better how you can find the root cause. Stream provides several ways of identifying issues. The Monitoring page shows the health of your Stream deployment. It displays information on the system resources, traffic in/out of the system, collection jobs and tasks, groups, workers, sources, destinations, etc. The coverage is limited to the previous 24 hours on leader nodes. You can also configure Stream to send its internal logs/metrics to a third-party system.
You can see information on sending or receiving data from Stream by going to Monitoring -> Data -> Sources/Destinations. Here you can see a list of configured Sources/Destinations and their status. For example, a healthy Source/Destination will have a green check, and an unhealthy Source/Destination will either have a yellow exclamation point (!) if it is experiencing issues or a red exclamation point (!) if it is not working. If you see any problems, you can drill in further by clicking on either unhealthy status indicator to see any associated errors or warnings.
Notifications
In Stream 3.1 and later, you can configure notifications for Destinations that report errors, experience backpressure, and more. These notifications can be a proactive way to let you know there is an issue with a Destination without looking at the Monitoring page. You can configure notifications per Destination. Conditions include Destination backpressure activated, persistent queue usage, and unhealthy destination.
The default target for sending notifications is System Messages. However, you can add additional targets, including PagerDuty or via webhook.
Cribl Stream Logs
A great place to look at Stream logs is the Monitoring -> Logs page, which provides an interface to filter through the leader node logs, worker group logs, and worker node logs. To change the log level, you can go to Settings -> Logging -> Levels per worker group.
JavaScript functions are supported in the Search field. For example, you can use message.includes(‘error’)
to see events with “error” in the message field. You can filter on any field and also can select which fields you want to display in the results by selecting them on the left-hand side of the page.
You can find a detailed overview of Cribl’s internal logs in our documentation.
Tips for Troubleshooting
Now we have gone over some ways of identifying when a problem is happening. What are some ways you might go about troubleshooting the root cause?
General Problems with Sending/Receiving
Let’s start with some general issues that we come across:
- Typos: Typos are easy to make and can be difficult to troubleshoot. Make sure you double-check your hostname, IPs, ports, metadata fields, and missing quotes in your Source/Destination configurations.
- TLS/SSL certs, keys, and passphrases: Make sure everything is supported and in sync at both ends.
- Tenant IDs and topics/subscriptions: Double-check you have the correct IDs or topics/subscriptions for Sources/Destinations that require it.
- Permissions: Double-check you have the correct permissions on the machine. For example, does the application have permission to listen on a privileged port?
- Proxy settings: Make sure your proxy settings are correct.
- Licensing and telemetry: If you are using a free license, telemetry is required. If telemetry is blocked or your license is expired then inputs will be blocked.
Event Processing Order
When troubleshooting sending/receiving issues within Stream, knowing the event processing order is crucial. This is important for helping you understand where the issue lies. Is the issue with the Source, Destination, Route, Pipeline, or something else entirely?
Typically the best way to troubleshoot Sending/Receiving data within Stream is to start at the Destination and work your way up the data stream.
Destinations
Stream supports the following Destinations:
- Streaming: Destinations that accept events in real-time. (e.g., Splunk, Syslog)
- Non-Streaming: Destinations that accept events in batches. (e.g., S3, Filesystem)
- Output Router: Enables selection of Destinations based on rules.
- DevNull: a special Destination that drops all events.
Destination Problems:
- View the Logs tab first for a Destination to see if there are any hints in the logs as to what the problem is.
- Is the Destination operational/reachable?
- Can you ping the server?
- Test the connectivity to the destination with port using tools such as nc/telnet.
- Run a test for the Destination by clicking “Run Test” in the destination configuration under the “Test” tab.
- Does a live capture show events before reaching the destination?
- You can test this by doing a live capture and selecting “4 – Before the Destination” on the “where to capture” dropdown.
- Is the persistent queue (PQ) being engaged?
- You can check this by looking at worker process logs or by looking at Monitoring -> Queues.
- Is data being accumulated in the PQ? Check
$CRIBL_HOME/state/queues/
- Are permissions for writing to the queue correct?
- Is the data payload properly configured? For example, malformed payloads for http can cause 4xx or 5xx level errors.
- Is a proxy required?
- Are the proxy environment variables set?
- If systemd is in use, are the proxy variables defined correctly in the systemd unit file or only in the user’s bash profile?
- Is the Destination newly configured?
- Were the changes saved and deployed? (distributed)
Sources
- Collectors: Sources we collect data from intermittently. Either ad hoc or on a preset schedule (e.g. REST, filesystem/NFS, Azure Blob, custom scripts).
- Pull: Sources that we pull from (e.g., S3, Kinesis)
- Push: Sources that push to us (e.g., Splunk, TCP)
- Internal: Sources that are internal to us (e.g., Datagens, Internal logs/metrics)
Source Problems:
- What is the status of the source?
- Note Sources will have a red status on leader until they are deployed to a worker group. The status can still be red if there are binding issues for some reason (privileges, non-existent IP bound to an interface, etc)
- If you do a live capture on the Source, are there any events?
- Make sure the JavaScript filter set for the live capture is correct.
- If no data is returned the problem is likely with the network or further upstream.
- Is the Source operational/reachable?
- Can you ping the server?
- Using the nc/telnet command, you can test the connection to the source.
- Does a packet capture/tcpdump show data is being received?
- Is the Destination for this Source triggering backpressure?
- You can check by going to the Destination in Monitoring -> Destinations and clicking on the Status. If you look under the Logs section you will see logs with level “warn” and the message “begin backpressure [blocking]”. You will also see logs with level “warn” and the message “sending is blocked”.
- If the Source is connected via a Route to a Destination that is triggering backpressure the Source might stop sending data if backpressure behavior is set to Block.
- Check the source configuration.
- Are there any typos?
- Do you have proper authentication?
Resources
There are tons of resources available to you to quickly troubleshoot issues with Cribl Stream. We have several available in our documentation.
- Common errors – This is an excellent list of common errors you might encounter when setting up Stream. It is a great first place to check the error you are receiving before googling it. Most of the issues you’ll encounter are most likely documented here with easy fixes.
- Known issues – This is a list of known issues documented with the version affected, a description of the problem, a workaround, and the planned fix version.
- Cribl Community – Join our Community of Criblanians and Users. This is a great place to ask questions, get help, or suggest topics for future blog posts!
Finally, if you have run out of ideas on troubleshooting an issue further, it might be time to contact support. When contacting support, it will always speed up the process by uploading a diag of the affected Cribl servers to your case.
The fastest way to get started with Cribl Stream and Cribl Edge is to try the Free Cloud Sandboxes.