Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

How Cribl Search Can Save You From Drowning in a Deluge of Data
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Avoid Vendor Lock-in ›

Agent Consolidation ›

Slash Storage Costs ›

Free Up Space for High-Value Data ›

Route From Any Source To Any Destination ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

July 31 | 10am PT / 1pm ET

Navigating the Data Current Report: Transforming IT & Security Operations in 2024
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

Parallelizing with Playwright: A Scalable Win for Cribl.Cloud

May 31, 2024

Written by

Categories: Engineering

Back To Blogs

An oft-forgotten component of robust, production-ready code is testing. The moat protects us from costly service interruptions and fortifies trust in our product with our customers. Simply put, it’s in the critical path of damn good software. However, as we scale a cloud product to serve a rapidly growing user base, our test case scenarios scale correspondingly. As far as testing goes, end-to-end (E2E) testing most closely mirrors the end-user experience. As developers, we aim to emulate product navigation and user flows in sync with any feature’s typical user experience. But this flow can, over time, become cumbersome and slow, particularly in a non-parallel testing environment. In the case of Cribl.Cloud, we encountered the scaling limitations of our Codecept testing platform, witnessing excruciatingly long test execution times and flakey failures. With the transition to Playwright as our E2E testing platform, we not only streamlined our tests for a better development experience but also achieved significant performance gains, reducing our execution time by nearly 85% compared to our previous E2E framework thanks to test parallelization.

What’s the Problem?

To understand why Cribl Cloud needed a new testing platform, we should first take a look at our previous solution and understand some of the reasons why an organization would want to invest in replacing a working (albeit cumbersome) testing platform, Codecept.

1.) Slow Testing Platform

Codecept has advantages that make sense for an engineering organization early in the development life: an abstraction layer that wraps Playwright drivers which makes tests easy to write without needing to implement cumbersome helper functions, readability, making testing writing simpler than many other test platforms, and a strong community for support which is important when developing using 3rd party libraries. Simply put, tests can be written quickly. Choosing Codecept may have been the right decision at the time.

However, as Cribl Cloud matured, these advantages turned into disadvantages largely due to bloated drivers and abstractions in Codecept not needed in our configuration, leading to longer test execution times—a particularly painful occurrence in a CI/CD pipeline where minutes can be crucial for deploying code in critical scenarios like incident response and management. Over seven quarters, as testing scenarios scaled, runtimes increased from around 10 minutes to 50 minutes for a single deployment of end-to-end tests! Couple that with a nearly 50% flaky failure rate and a recently tripled team, and you can start to frame the investment need for a viable alternative to deploy efficiently.

2.) No Shared State Due to Parallelization

Who loves flakey tests and smashing the re-run CI/CD button into oblivion to get a PR over the line? It’s universally understood that developers must accept a threshold of failures that are inherent to any testing platform. However, in the case of our Codecept implementation, we experienced a rate of failures much higher than this threshold and a root cause analysis pointed to the issue:

A single-threaded test runner using a single set of user credentials for a single organization across multiple test execution runs created an inconsistent, non-atomic, and non-idempotent testing environment.

This is quite possibly the most cardinal sin one can commit in testing: tests might influence one another due to shared states or leftover data, causing subsequent tests to not start from a clean slate. This can affect the ability of each test to operate independently without side effects.

How does someone who needs to get a pull request merged bypass this? Just keep smashing the re-run button; eventually, your pull request wins, and you get to merge. However, behind you is a trail of destruction from failed tests of other merge candidates, forcing a long queue for pull requests hoping to be next in line to merge. Repeat this enough times with a bias towards delivering code quickly, and trust in testing erodes, tests are skipped, and quality is compromised.

Remember earlier how we mentioned end-to-end tests sometimes took nearly 50 minutes? We compound this problem as we scale. This was a massive scalability problem. This was an inefficiency problem. This was a costly problem.

3.) Improved Login Experience

You might be wondering why a single set of credentials was used across all tests amongst multiple test runs. Because creating a logged-in user experience in a web application is actually quite complex to implement. In Codecept, leveraging before () hooks ensured that logged-in states were maintained. However, as you scale this solution to more than just a half dozen tests, you start to run into some serious baggage and friction when trying to execute tests due to time overhead. We needed a way to have the testing user login experience be abstracted from the tests and baked into the testing platform itself, not only for testing efficiency but also for creating an easy testing environment for newly onboarded engineers in a quickly scaling team.

Be Careful of False Prophets, Unless It’s Playwright

Admittedly, when we ventured to identify and solve this ambiguous set of problems, we did not know how much better the successor would be, or if it would be a success at all. When deciding between different platforms, there wasn’t a consensus—except for the understanding that our current solution with Codecept needed replacement. Candidates like Selenium and Cypress were considered, but in early runtime benchmarking of testing platforms, one candidate stood out: Playwright by Microsoft.

Why Playwright? Well, it just makes sense. By using Codecept, we were already utilizing an abstracted and heavier version of Playwright, so logically, we knew it couldn’t be slower than our current solution. Coupling that information with the parallelization properties of Playwright and the aforementioned performance benchmarks, and we had a strong candidate. We decided to take an educated gamble on Playwright as our successor to Codecept.

So, how did we achieve this? Well, we knew that we needed to migrate our existing tests onto the new platform without losing coverage and set out to achieve this by devising a plan to phase transition: install and set up Playwright, migrate and refactor any missing helpers/functions to execute tests, test and benchmark our tests in CI/CD, staging, and production environment. After this transition, we deprecated and removed codecept and all pipelines associated with it.

Okay So You Have the “Successor” – Get to the Good Part

In a perfect world, engineering challenges are implemented with no friction and are able to be delivered to all stakeholders on a silver platter. However, in the stochastic and unpredictable real world unforeseen challenges can rear their heads at many different points of the project lifecycle. The range of challenges we faced included package management issues, compute cost issues, and resource sanitation (another cost issue), amongst others.

Addressing the Slow Testing Platform through Parallelization

We significantly enhanced our testing process by transitioning to Playwright and harnessing its parallelization capabilities. Playwright’s fixtures and built-in support for parallel test execution allowed us to run multiple tests concurrently without a shared state, greatly reducing the overhead and bottlenecks associated with serial test execution. The adoption of parallel testing effectively addressed the scalability and efficiency issues that had challenged our previous framework, allowing us to scale our testing operations.

Enhancing Test Reliability Through Parallelization

Playwright’s adoption marked a shift to a parallelized testing approach, enabling multiple tests to run concurrently without sharing state due to ephemeral organizations dedicated to each test runner. This change effectively eliminated the flakiness and inconsistencies previously experienced due to shared state, ensuring that each test could operate independently and reliably. Playwright’s robust parallelization capabilities greatly enhanced the efficiency and reliability of our testing process, supporting our rapidly scaling needs.

Streamlining the Login Experience with Integrated Fixtures

Playwright’s fixtures provided an innovative solution, allowing us to abstract and automate the login process across tests. This setup ensured that each test started with a clean state, with ephemeral users automatically authenticated and initialized at the beginning of each test. This approach not only simplified the testing process but also reduced the setup time for each test, making the testing environment more conducive to rapid development and easier for new engineers to integrate into the team. This includes creating ephemeral users and initializing them to a standard landing page, drastically reducing the setup complexity and time previously encountered with Codecept.

…However Perfect Parallelization Doesn’t Exist

Going back to this perfect world where we have unlimited resources and compute to execute what we want without any consequences all things can run perfectly in parallel. We run our tests suites as fast as our slowest test. We naively tried to do this, and we found that we took the 50-minute runtime mentioned earlier to as low as 2 minutes! However what worked locally turned out to be costly as we used ephemeral organizations that, if not cleaned properly, incurred significant AWS cloud compute costs both operationally and financially. We definitely FAFO’d…

Testing with AWS Organizations Can Cost A Lot Operationally

Because we are a cloud team that used real users that are assigned real AWS organizations in staging and production environments, we saw a problem that didn’t see when running tests locally against LocalStack.

1.) Organizations are free in LocalStack, as it’s an AWS emulator. When running in staging and production environments it costs money, like real money.

When developing against local, we had a lot of fun with how many orgs we wanted to run in parallel by adjusting the Playwright config file. Initially we tried running every test at once, which took us to that 2 minute runtime number. However, we found a constraint very early on: our test run times were quicker than the terraform provisioning lifecycle for our organizations. Why does this matter? When your organization’s provision lifecycle is roughly 10 minutes, and you are running tests that take 50 minutes, you don’t have a risk of needing organizations faster than they can be provisioned. Take that number to under 5 minutes, and you will start to see how this can become an issue.

2.) Using more AWS organizations to test than are available can create issues with availability for actual users trying to sign up and provision an org in staging and production environments

Since we were deploying and destroying so many orgs at once, we started to see that users couldn’t use pre provisioned organizations and this created availability issues for our platform. We ran into our first resourcing issue, and we had to protect our resources by implementing checks to prevent our available organizations from depleting beyond critical levels to maintain our availability.

Failing to Cleanup AWS Organizations Can Cost A Lot Financially

After deploying Playwright in staging and production, we noticed a troubling trend: our daily AWS cloud spend was escalating each day. A quick investigation revealed that over 750 AWS cloud organizations had been spun up in the last 48 hours in staging and production. Normally, such rapid growth would be cause for celebration, perhaps suggesting significant expansion. However, we discovered that these organizations were associated with the Cribl test automation user. After shutting down the tests, a root cause analysis revealed that in the CI environment, Ubuntu Bitbucket runners’ child cleanup processes, spawned by the Playwright parent process could not read the organizations provisioned by the parent on the data volume in memory we were storing that information on. Playwright deploy and cleanup was triggered by parent process, and the runners were forked child processes. As a consequence, the teardown process (a parent process) lacked the necessary context for making cleanup API calls to tear down AWS Organizations since these were written to isolated memory by the child process, and all attempted cleanups failed— for 3 days. This issue was eventually resolved by writing organizations and users to an external temporary volume on the runner. This allowed the parent processes to perform cleanup independently of the subprocess spawn hierarchy on CI, saving the day—and a significant amount of money.

The Happy Parallelization Medium

What we discovered was that perfect parallelization unfortunately doesn’t exist. Mathematically, there’s an asymptotic upper limit to performance with parallelization, and the optimum balance between performance and resource cost lies somewhere between both extremes. By using only three to four test runners in parallel, we achieved an 85 percent performance boost, reducing our test times from 50 minutes to under 5 minutes. The diminishing returns relative to the operational costs proved to be a satisfactory trade-off for our use case.

In Closing

The transition to Playwright at Cribl.Cloud marked a significant evolution in our approach to software testing, dramatically enhancing our processes’ efficiency and reliability. By implementing Playwright’s parallelization capabilities and integrated fixtures, we mitigated the scalability and efficiency challenges posed by our previous framework and revolutionized how we manage and execute tests. This strategic upgrade reduced our test execution times from 50 minutes to as low as 2 minutes, underscoring our ability to adapt and innovate rapidly.

However, this transition also brought to light new challenges, particularly in the management and financial implications of using AWS Organizations extensively for testing. The issues of resource depletion and high operational costs were significant but were addressed through innovative solutions like external volume storage for cleanup processes, ensuring the sustainability and cost-effectiveness of our testing environment.

As we refine our testing strategies and infrastructure, the lessons learned from deploying Playwright reinforce the importance of continuous evaluation and adaptation in technology choices. They also highlight our commitment to maintaining a robust, efficient, and economically feasible testing framework that supports our growth and the reliability of our products.

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

Blog

Preventing Friction With an Impactful Security Champions Program

Blog

From Necessity to Opportunity: The Customer Push for SIEM Options

Blog

Securing the Foundation of Cribl Copilot

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud with pre-made sources and destinations.

Launch Now

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Download Library

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us

Parallelizing with Playwright: A Scalable Win for Cribl.Cloud

Written by

Usman Zahid

What’s the Problem?

Be Careful of False Prophets, Unless It’s Playwright

Okay So You Have the “Successor” – Get to the Good Part

Addressing the Slow Testing Platform through Parallelization

Enhancing Test Reliability Through Parallelization

Streamlining the Login Experience with Integrated Fixtures

…However Perfect Parallelization Doesn’t Exist

Testing with AWS Organizations Can Cost A Lot Operationally

Failing to Cleanup AWS Organizations Can Cost A Lot Financially

The Happy Parallelization Medium

In Closing

Blog

Preventing Friction With an Impactful Security Champions Program

Blog

From Necessity to Opportunity: The Customer Push for SIEM Options

Blog

Securing the Foundation of Cribl Copilot

Try Your Own Cribl Sandbox

So you're rockin' Internet Explorer!