In the data business, we often refer to the series of steps or processes used to collect, transform, and analyze data as “pipelines.” As a data scientist, I find this analogy fitting, as my concerns around data closely mirror those most people have with water: Where is it coming from? What’s in it? How can we optimize its quality, quantity, and pressure for its intended use? And, crucially, is it leaking anywhere?
As it turns out, many valuable lessons from the world of plumbing apply to data management. One key takeaway is the importance of control and relief valves, which help regulate water, volume, and pressure while allowing for the swift isolation of problems and risk management. In commercial plumbing, these valves are installed at virtually every water fixture and pipe connection, ensuring issues are quickly contained without disrupting the water supply for the rest of the building. Contrast this with residential plumbing, where a limited number of valves may necessitate shutting off the water for the entire house to address even minor issues.
If data is a critical resource for your organization, elevating your data plumbing to commercial standards by deploying a pipeline between your sources and destinations is essential. Here are a few areas where applying commercial plumbing best practices can significantly improve your pipeline infrastructure:
Controlling water pressure is crucial for a pleasant experience, helping to prevent the water explosion of an overpressurized sink and the frustration of taking an under-pressurized shower. Similarly, managing data pressure is vital to avoid performance degradation, unexpected costs, or loss.
When data explodes unexpectedly, in a brute force attack, port scanning event, or the dreaded self-pwn via malfunction or misconfiguration, the fallout can be costly. In addition to monetary costs, companies may also experience pipeline degradation or failure, leading to data loss and potential lapses in security coverage. Many organizations employ various solutions, such as elastic pipelines that dynamically adapt to changing data streams, advanced monitoring and alert software, and improved utilization of on-premise computing, in an effort to address this issue. However, despite these efforts, a significant number of organizations still lack the appropriate level of control and visibility required to effectively operate a modern enterprise.
Data pipelines can help to control data volume by employing purpose-built filtering and enrichment, as well as eliminating missing, duplicate, or unnecessary fields. They help reduce the amount of unnecessary data moving around the organization by collecting data once and creating custom streams for each use, containing only the necessary data.
The speed at which data can flow through an organization, known as data velocity, plays a crucial role in both performance and particularly security. Pipelines give security teams the ability to enrich data in near real-time, adding critical information like IP location, endpoint asset information, and standardized timestamps to logs before they’re ingested by the security platform. Absent a pipeline, enrichment typically happens within the security platform. Some add this enrichment data minutes to hours later, while others only add it when searching, meaning the data may have changed or may never be seen. By adding this data on ingestion, events are captured with complete information, and detections can be made at the moment.
Unexpected “leaks” or breaches can have serious consequences. For water pipelines, a leak can lead to property damage or even health hazards if the water becomes contaminated. Similarly, for data, a leak or breach can result in sensitive data being exposed, operations being interrupted, and loss of revenue. Data pipelines provide control over the flow of data from beginning to end and enable the quick detection and mitigation of contamination, leaks, and malfunctions. Pipelines also enable the monitoring of data flow and quality, providing critical visibility to help quickly identify leaks or contamination.
Just as shut-off valves can prevent water damage to a property, data management pipelines can help prevent “data damage” or loss. As attackers move up the supply chain, SaaS vendor agents and applications are becoming a common attack vector. In the event of a software vendor compromise, data pipelines allow data to be shut off or diverted to an alternate destination, such as a data lake. This prevents data loss while still stopping the flow of sensitive data to a compromised destination and helps reduce the potential for further data exfiltration.
Elevating your data pipelines to commercial standards is essential for organizations that rely heavily on data as a critical resource. Companies can boost the resilience, efficiency, and security of their data infrastructure by leveraging lessons from commercial plumbing, incorporating best practices to manage pressure, volume, and velocity, and effectively identifying and isolating leaks. Commercial-grade pipelines help organizations safeguard their most valuable asset—data—and maintain a competitive edge in today’s data-driven business landscape.
Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.