As 2020 comes to a close (finally!) I have spent some time reflecting on some of our engineering achievements and, more importantly, lessons of the past few years. I am documenting them openly as much for our current Criblanians but also for those who are considering joining us.
High Leverage
I believe that a resource-constrained environment is a fertile ground for innovation – startups are able to disrupt incumbents that have orders of magnitude more resources. Focusing limited resources around highly leveraged engineering decisions is a proven way to build disruptive solutions. These decisions are the ones that lead to exponential results over time – a collection of such decisions can give an eng org an unfair advantage. In engineering terms this would be equivalent to coming up with an algorithm with better time and space complexity. I will list two examples of such decisions we’ve made that have accelerate building of LogStream.
JavaScript – there are many facets to this decision, but for now I’ll focus on the decision to expose JS expressions as a way for users to select and manipulate their data. It is very common in data processing for products to create custom DSLs – which requires massive efforts not only in design and implementation but also in training materials, UX support, troubleshooting tooling etc. Compare that to adopting the most popular programming language – not only bringing all those costs almost down to 0, but also minimizing time to market
JSON-schema – since the very beginning we chose to standardise and use it throughout the product, to concurrently solve three problems (a) specify configuration schema (b) perform consistent validation and (c) drive the front-end display. This allowed us to ensure that everything is configurable via the UI with minimal involvement from frontend developers
These are just a few examples for key decisions that concurrently addressed 3 or more problems that we were either facing or would soon be. You’ll see this theme intertwined on all our principles below.
High Impact
One of our core values is: Customers First, Always – in engineering the following two questions are key to every new feature we work on:
What problem are we solving?
… and for whom?
Building sexy features, using high leverage engineering decisions that provide value to no-one is … well, pointless. Everyone wants to work on the next shiny feature, however at Cribl we focus primarily on the impact of that feature on our customers. In enterprise software there are many mundane, boring, non-sexy features required in order for the product to be viable long term. Meaning, all those cool features driven by the latest in unbounded machine learning and AI would never have a chance to be tried or ever used. I’ll mention two examples:
Capture & Preview – users of streaming systems have a need to be able to introspect the stream and be able to get a sample of the live data. This allows them to ensure that data is in the right shape and headed to the right system. If any changes need to be made we want users to get instant feedback on their changes, however it is undesirable to make changes directly in the production stream. This is where preview shines, giving users the ability to feed the sample data through the data processing pipelines they’re working on, thus not only getting immediate feedback and reducing frustration but also increasing the confidence in their work.
RBAC – this is an enterprise feature (read: non-sexy) that we’ve been working on for our next release. Access control is foundational to making LogStream a self service system. RBAC empowers admins to grant teams access to their data streams and allow them to shape, filter or route it how they see fit – removing the admin from the position where they need to make and own all changes made to data processing.
In the above examples you’ll see again multiple problems addressed concurrently, with the goal of improving the experience of working with streaming data.
Simplicity
LogStream, once deployed, becomes a critical component in a customer’s infrastructure. As such, we need to ensure a high level of quality. Anyone who’s built distributed systems will tell you: they are hard to build, test and maintain. We have seen how the code base of such systems increases in complexity over time, as more corner cases are discovered and/or different failure scenarios are observed. I believe that once a code base reaches a certain level of complexity, its complexity would only keep increasing – the reason is simple: as more corner cases are observed, to avoid perturbing the rest of the system, one-off fixes are added – increasing the overall complexity. By design, code is a communication medium between engineers and machines, however it also plays a very important role as a communication tool between engineers (think code reviews, new team members on-boarding, or your future self). Code that is simpler to read and reason about is also simpler to get feedback and help on, simpler to test and simpler to ensure operational correctness – resulting in higher overall quality. We see complexity as a highly undesirable destination and strive to come up with the simplest, most elegant and readable solution.
Ship it!
The only perfect piece of software is the one that never ships! Striking a balance between shipping fast and shipping high quality is key for us. By tightly scoping a new feature, shipping it, then improving it based on actual user feedback we have managed to strike that balance. From our experience we’ve observed that new product features/components of a system will need to be modified, rewritten or even superseded over a period of 12–18 months. The key reason behind this is the fact that the feature or component is new, and therefore by definition the requirements are likely incomplete, and the testing environment or test cases may not capture all use cases. If the feature is released and gains traction then missing functionality would become apparent, new scaling requirements might surface – sometimes leading to major modifications or even rewrite. On the other hand, if there is no traction then rewrite might simply be deprecation and removal. One of the lessons we’ve learned over the past few years is patience, sometimes you can deliver features faster than what the customer base may uptake which could easily be misinterpreted as no traction. A concrete example here would be adding support for a syslog source and destination – it took ~9 months before that feature started gaining traction, during which time we had completely written that feature off as wasted time. Today syslog is one of our most popular sources and destinations – give your user base a chance to uptake your work before deprecating it.
Continuous Learning
From the early days we decided to standardise the development of LogStream on TypeScript – the backend would execute on NodeJS while the frontend would be a React web application. That initial decision was based on a few technical merits which I won’t go into, but also in a strong belief that a common language would help build a stronger team where everyone would be able to contribute to the overall team’s knowledge. Over the past few years we’ve seen the team knowledge really expand as well as easily cross the “frontend/backend” barriers.
As engineers, we are constantly required to learn, from new APIs, to updates to the runtime or new business requirements, etc. Reducing the cognitive load of a different language allows us to spend more time focusing on the problems that matter. Wait, why are we not picking “the best tool for the job” or the newest sexy language on the block (yes, yes, I’m talking about Rust)? The answer is simple: we just do not believe that such choices provide enough leverage to be worth the added complexity and friction that they introduce. Drawing parallels between building software and cooking, pro chefs choose to masterfully use very few general purpose tools (think knife) vs a ton of special purpose tools (onion slicer + avocado slicer + xyz slicer).
Conclusion
As engineers at Cribl, we build and ship software that allows our users to extract value out of all their machine data. We do this by making high leverage decisions, focusing on simple and elegant solutions to important customer problems and continuously learning as a team.