September 13, 2022
At this point, you already know how powerful syslog is (and if you don’t, check out “Introduction to Syslog”). But here’s the thing: Scaling your systems to consume high volume syslog is like fighting zombies. Weird unexpected behavior and no easy solutions. Before you fight zombies, though, you have to understand them.
So, here are the challenges for scaling syslog one by one:
Network devices send data using UDP and then forget about it, unlike a TCP connection, which can check if the destination and port are present. A TCP connection will also accept data before sending it. This matters because you cannot send UDP syslog over long distances as every network hop makes it more likely you will drop data. Locate your syslog servers as close to your UDP data sources as possible to make it more likely you are capturing all of your data. It’s critical to recognize that this is simply the nature of the beast, and you must comprehend it.
There are two well defined RFCs that describe what syslog should look like and it could be transmitted to your logging tools. Yet much of what vendors call syslog is not RFC compliant, which can cause parsing and event management pain. So, what do you do about that? First, don’t make assumptions. Second, verify your data as you onboard it. Be sure to check your timestamp and create the right source and event breakers to consume your data. Back to rule one: never assume what someone says is syslog is actually syslog and not random data delivered over UDP.
Anyone ingesting syslog has to plan for spikes and keep the worst-case scenario in mind because your typical volume baseline will not always stay consistent. When you need your observability platform to work the most is when data volumes are through the roof and everything is going badly. When your firewall is broken and vomiting data or you have a breach and the security team is knee-deep in alligators.
Plan for the spike and not your daily average. Work with your business leaders to determine your level of robustness and at what point you think your systems will start to fail. This is a business decision, so offer data to support your options and let your leaders determine what level of investment they are willing to support. Be sure to keep your documentation to deal with the inevitable finger-pointing if something goes wrong.
Securing syslog is an old issue and involves many tradeoffs. UDP packets cannot be encrypted; therefore, they can’t be secured with SSL, which can be a challenging topic to discuss with security professionals. You can secure TCP syslog but then you run into scaling issues with load balancers and the high overhead that can impact certain devices. I recommend putting your syslog servers as close as possible to the syslog sources and then using your syslog server to handle encryption, which is a reasonable compensating control.
This is the number one issue in a high-volume environment: being able to consume the data, get it into your logging systems, and not drop the data is an enormous challenge. Without thinking through the scaling challenges, you might create a massive security gap. Keep all of this in mind as you fight the good fight:
Have a discussion with your network and storage engineers about implementing a standardized data format. It is critical to collaborate with your teams and document the standards that you will use as an organization to hold these teams responsible. Every class of devices should have a master logging config that everyone can agree to. Be sure to audit formats as often as possible to catch one-offs and have a process to respond. This is a good task for your overnight teams and contributes to good data governance.
There’s a lot of discussion over the best technique to load balance syslog. Essentially, you need to know if your environment can support it. Pro tip: Don’t implement an IP multicast load balancer if your network and virtualization teams lack the necessary skills. Always implement what you can support and aspire to do more when your teams have the right skill sets. Load balancing syslog is a challenge for many network teams so be prepared to implement and then make adjustments till you get it right.
Assume you have a network with many syslog servers, which may be placed far away from the source as the network grows. It’s vital to collect data as close to the source as feasible when scaling syslog. By placing a syslog server near the source, you can send UDP to it and have it send the data to Splunk or Elastic over TCP. As a result, you get the best mix of UDP delivery, proximity to the source, and TCP delivery. If you use Cribl Stream as your syslog pipeline, it will support queuing, which means that if your destination goes down, Cribl Stream will queue it, send it to the file system, and then retransmit it when your destination comes back up.
A considerable portion of syslog data is not syslog but random data from your network devices. You have to have alternatives for cleaning, repairing, improving, and minimizing your data with Cribl Stream. You can also validate your data and use Stream’s visual validation feature to make sure your data parses as expected.
As an observability engineer, you need to work with syslog data in a scalable way as this ensures that you have quality data outputs and your operations and security teams get the right data to help them do their jobs. Using a tool like Cribl Stream is a step in the right direction to solving your problems with syslog data. Check out part 3 of our Syslog series to see how Cribl Stream helps enterprises manage syslog scaling.
The fastest way to get started with scaling syslog is with Cribl Stream. Try the Free Cloud Sandboxes to get started.