Maintaining operational visibility and controlling costs are critical in demanding cloud environments. At scale, native solutions like AWS CloudWatch can become costly, particularly when handling large volumes of metrics across multiple organizations. This is where the custom CloudWatch Exporter proves valuable, offering a more customized and cost-effective approach to cloud monitoring. Essentially, the CloudWatch Exporter gathers metrics from AWS CloudWatch and makes them available in a format compatible with external monitoring systems like Prometheus, allowing you to centralize all your metrics efficiently.
Why a Custom CloudWatch Exporter?
The need for a custom CloudWatch exporter arises from the significant costs associated with AWS CloudWatch when handling large-scale operations. In the below CloudWatch setup, metrics streams push around 1k+ metrics per minute, per tenant from multiple sources such as EC2, Load Balancer, ASG, EFS, S3, and SNS, into a pipeline that includes Kinesis, S3, Cribl Stream and Grafana for visualization. For all organizations, this leads to a staggering hundreds of thousands of dollars per month in costs.
The issue stems from the inefficiency of pushing excessive metrics that are eventually filtered out downstream. The default setup incurs unnecessary data transfer costs and results in redundant metrics collection, which adds to the storage and operational overhead.
AWS native way to extract CloudWatch metrics to a centralized observability system
The Cost Challenge
The breakdown of costs illustrates the burden:
CloudWatch Metric Stream: Charges at $0.003 per 1,000 metrics. With 8.2 million metrics generated monthly, this leads to around $25 per tenant.
S3 Costs: Storing metrics via Kinesis results in significant spikes in S3 usage costs around a few tens of thousands of dollars per month.
SQS for Notifications: Adds about $3500 per month to trigger further actions like converting metrics to Prometheus format.
Together, these services result in a cumulative cost of a few hundreds of thousands of dollars per month.
Lambda-Based CloudWatch Exporter
By using AWS Lambda to manage CloudWatch metrics scraping, the team significantly reduced costs and simplified the monitoring pipeline. Lambda allows for on-demand execution, which only runs when required, which means costs are incurred only when the function is triggered to scrape and process the necessary metrics.
The Lambda-based exporter scraps only a few critical metrics, such as AWS/NetworkELB HealthyHostCount and UnHealthyHostCount. This drastically reduces the number of metrics being processed, eliminating the need for complex, costly services like Kinesis and S3.
How the Lambda Approach Works
On-Demand Metric Scraping: Lambda is set up to scrape CloudWatch metrics based on a defined schedule (e.g., every minute). Instead of collecting all possible metrics, it scraps only the most critical metrics per tenant.
Push to Cribl Stream: Once the metrics are scraped, Lambda pushes metrics to SQS, which Cribl Stream will consume. Stream helps us enrich the data to Prometheus format and send it to a Prometheus destination. Grafana then queries Prometheus for real-time monitoring and visualization.
Simple Automation and Scalability: Lambda functions can be automatically deployed and scaled as needed using an Infrastructure-as-Code practice like Terraform. This allows for seamless onboarding of new tenants and effortless scaling as the number of tenants grows.
Architecture for custom CloudWatch exporter
Cost Savings with Lambda
One of the most significant benefits of using Lambda is cost efficiency. AWS charges $0.01 per 1,000 API requests. Since Lambda is scraping only 10 metrics per minute per tenant, the total number of metrics being processed drops significantly. Here’s a cost estimate:
Metrics Scraped per Tenant: 10 metrics per minute, equating to 432,000 metrics per month.
Cost per Tenant: At $0.01 per 1,000 requests, the cost comes to $4.32 per tenant, per month.
Total Cost for our tenants would come down to Approximately tens of thousands of dollars per month.
Additionally, the cost of hosting the Lambda function itself is minimal, and it runs within the free tier limit within the tenant accounts.
Why Lambda Is a Cost-Effective Solution
Pay-As-You-Go: Unlike long-running services like EC2, Lambda incurs charges only when it executes. This dramatically reduces costs, especially in setups where metrics scraping isn’t required constantly.
Eliminating Redundant Infrastructure: Lambda removes the need for additional services like Kinesis and S3, which previously contributed heavily to monthly costs. The Lambda function optimizes data flow and minimizes storage requirements by scraping only necessary metrics.
Scalable and Automated: Lambda scales automatically with demand, handling more tenants without requiring manual intervention. Automation tools like Terraform can manage new tenant onboarding seamlessly, ensuring that no additional cost is incurred until new tenants generate metrics.
Conclusion
The Lambda-based CloudWatch exporter is a highly efficient, cost-saving solution for managing cloud metrics in AWS. By focusing on scraping only the essential metrics and using AWS Lambda for on-demand execution, the team reduced the CloudWatch-related expenses by over 85%. This approach offers a scalable, automated, cost-effective way to monitor AWS environments at scale. With the help of Cribl Stream, you will be able to enrich metric dimensions before you publish it to Prometheus.