December 23, 2020
As 2020 comes to a close (finally!) I have spent some time reflecting on some of our engineering achievements and, more importantly, lessons of the past few years. I am documenting them openly as much for our current Criblanians but also for those who are considering joining us.
I believe that a resource-constrained environment is a fertile ground for innovation – startups are able to disrupt incumbents that have orders of magnitude more resources. Focusing limited resources around highly leveraged engineering decisions is a proven way to build disruptive solutions. These decisions are the ones that lead to exponential results over time – a collection of such decisions can give an eng org an unfair advantage. In engineering terms this would be equivalent to coming up with an algorithm with better time and space complexity. I will list two examples of such decisions we’ve made that have accelerate building of LogStream.
These are just a few examples for key decisions that concurrently addressed 3 or more problems that we were either facing or would soon be. You’ll see this theme intertwined on all our principles below.
One of our core values is: Customers First, Always – in engineering the following two questions are key to every new feature we work on:
Building sexy features, using high leverage engineering decisions that provide value to no-one is … well, pointless. Everyone wants to work on the next shiny feature, however at Cribl we focus primarily on the impact of that feature on our customers. In enterprise software there are many mundane, boring, non-sexy features required in order for the product to be viable long term. Meaning, all those cool features driven by the latest in unbounded machine learning and AI would never have a chance to be tried or ever used. I’ll mention two examples:
In the above examples you’ll see again multiple problems addressed concurrently, with the goal of improving the experience of working with streaming data.
LogStream, once deployed, becomes a critical component in a customer’s infrastructure. As such, we need to ensure a high level of quality. Anyone who’s built distributed systems will tell you: they are hard to build, test and maintain. We have seen how the code base of such systems increases in complexity over time, as more corner cases are discovered and/or different failure scenarios are observed. I believe that once a code base reaches a certain level of complexity, its complexity would only keep increasing – the reason is simple: as more corner cases are observed, to avoid perturbing the rest of the system, one-off fixes are added – increasing the overall complexity. By design, code is a communication medium between engineers and machines, however it also plays a very important role as a communication tool between engineers (think code reviews, new team members on-boarding, or your future self). Code that is simpler to read and reason about is also simpler to get feedback and help on, simpler to test and simpler to ensure operational correctness – resulting in higher overall quality. We see complexity as a highly undesirable destination and strive to come up with the simplest, most elegant and readable solution.
The only perfect piece of software is the one that never ships! Striking a balance between shipping fast and shipping high quality is key for us. By tightly scoping a new feature, shipping it, then improving it based on actual user feedback we have managed to strike that balance. From our experience we’ve observed that new product features/components of a system will need to be modified, rewritten or even superseded over a period of 12–18 months. The key reason behind this is the fact that the feature or component is new, and therefore by definition the requirements are likely incomplete, and the testing environment or test cases may not capture all use cases. If the feature is released and gains traction then missing functionality would become apparent, new scaling requirements might surface – sometimes leading to major modifications or even rewrite. On the other hand, if there is no traction then rewrite might simply be deprecation and removal. One of the lessons we’ve learned over the past few years is patience, sometimes you can deliver features faster than what the customer base may uptake which could easily be misinterpreted as no traction. A concrete example here would be adding support for a syslog source and destination – it took ~9 months before that feature started gaining traction, during which time we had completely written that feature off as wasted time. Today syslog is one of our most popular sources and destinations – give your user base a chance to uptake your work before deprecating it.
From the early days we decided to standardise the development of LogStream on TypeScript – the backend would execute on NodeJS while the frontend would be a React web application. That initial decision was based on a few technical merits which I won’t go into, but also in a strong belief that a common language would help build a stronger team where everyone would be able to contribute to the overall team’s knowledge. Over the past few years we’ve seen the team knowledge really expand as well as easily cross the “frontend/backend” barriers.
As engineers, we are constantly required to learn, from new APIs, to updates to the runtime or new business requirements, etc. Reducing the cognitive load of a different language allows us to spend more time focusing on the problems that matter. Wait, why are we not picking “the best tool for the job” or the newest sexy language on the block (yes, yes, I’m talking about Rust)? The answer is simple: we just do not believe that such choices provide enough leverage to be worth the added complexity and friction that they introduce. Drawing parallels between building software and cooking, pro chefs choose to masterfully use very few general purpose tools (think knife) vs a ton of special purpose tools (onion slicer + avocado slicer + xyz slicer).
As engineers at Cribl, we build and ship software that allows our users to extract value out of all their machine data. We do this by making high leverage decisions, focusing on simple and elegant solutions to important customer problems and continuously learning as a team.