LIVE blog image

Stop wasting tokens: Real-time LLM cost analytics inside your telemetry pipeline

Last edited: December 2, 2025

Monitoring AI interactions has become a hot topic, but actually doing that monitoring is messy. AI conversations hit every layer of the stack. The logs are scattered across application servers, API gateways, and model endpoints. Different formats across different systems, with no single source of truth.

None of the existing tools, from Anthropic’s console, to langfuse, to OpenAI’s dashboard, fully solves the problem.

Meanwhile, you need answers to critical questions:

  • Which conversations are expensive?

  • Who has the highest token usage across my users?

  • Which models are performing the best?

  • Which prompts should we summarize, or store in a cache?

  • How do we audit respondes for compliance and security

Right now you might be running 2-3 different monitoring tools just to piece this together. But here’s the thing; your AI logs are already flowing through your observability stack. You’re just not processing them.

At CriblCon, I built a demo for our new Insights feature (It’s amazing. Check it out here). Insights gives you visibility into your entire Cribl deployment - data flow, freshness, bottlenecks. To showcase this, I needed to add latency to the pipeline.

The issue with using the out of the box functions that Cribl Stream has is that they are optimized for processing by design. I needed something slow.

Cribl Stream has a code function for users to process data. It’s powerful but comes with a warning I give all customers: if your code doesn’t perform well, it can cause delays across your entire telemetry pipeline. Not something we recommend unless you absolutely need it. For a demo showcasing latency it's perfect. I wrote a deliberately bad function with a hardcoded delay. Running it, Insights showed the bottleneck. Mission accomplished, but then realized; what if I actually analyzed the AI conversation logs flowing through the pipeline? Real-time cost tracking, quality scoring, optimization recommendations, all in one function.

I built it (me and Claude). People loved it. Let me show you what it does.

From Raw Logs to Instant Intelligence:

Here is a realistic conversation from our demo environment. A user named Fernando interacts with an LLM about expense reimbursement. This is what happens when that conversation flows through the pipeline.

Before: What you get from your application logs:

Code example
{ "_raw": "{\"status\":200,\"timestamp\":\"2025-11-21T02:06:26Z\", \"model\":\"mistral:7b\", \"email\":\"fgarcia@criblcoffee.io\", \"assistant\":{ \"message\":\"I can help you with your expense_reimbursement request. Can you provide me with some basic information to get started?\" }, \"user\":{ \"message\":\"Hi, I need help with my expense_reimbursement. My name is fernando garcia and my email is fgarcia@criblcoffee.io\" }, \"request_time\":1, \"streaming\":true}", "host": "54.183.251.206", "_time": 1763690786 }

What you can see:

  • A chat happened.

  • A model processed the request and the time it took and the user's email address.

What you can't see:

  • How much did this interaction cost?

  • Was the user satisfied? Is this a good use of this model? For example, let's say we were using GPT-5.1 for this query, but it's a simple support question. Could GPT-4.1-mini handle it at a fraction of the cost? Without per-conversation enrichment, you're guessing.

  • Should we be caching these queries?

Answering those questions means we need more data.

AFTER: Enrichment in Real-time through a single pipeline

Code example
{ "_conversation_summary": "SUPPORT | mistral:7b | neutral sentiment | partial_success outcome | 0.0000 | 49 tokens | Grade: C", "_cost_analysis": { "cost_usd": { "total": 0.000049, "per_token": 0.000001 }, "projections": { "hourly": 0.18, "daily": 4.23, "monthly": 127.01 } }, "_response_quality": { "score": 40, "grade": "C", "quality_factors": { "addressed_question": 10, "appropriate_length": 20, "personalization": 10, "actionable_response": 0, "empathy_shown": 0 } }, "_conversation_intelligence": { "detected_intent": "support", "intent_confidence": 25, "topic_keywords": ["expense_reimbursement", "help", "fernando"] }, "_optimization_opportunities": [{ "type": "caching_opportunity", "recommendation": "Common query that could be cached", "cache_hit_potential": "high" }], "_processing_metadata": { "processing_time_ms": 0 } }

This new, enriched payload gives us more to work with. Let’s break it down.

Insights:

  • Cost intelligence:

    • This conversation costs 0.000049 USD. Same query on GPT-4: $0.0045 (90x more expensive). At the current rate, you’re spending $127/month on basic support queries. The pipeline tracks costs across 16 models in real-time and projects monthly spend. You can see if 15% users burn 80% of your budget? Now you can route those users to cheaper models or implement rate limiting. No surprises on the AWS bill.

  • Quality problem:

    • The function immediately flagged the conversation as subpar. The user provided their name and email, but the AI responded with “Can you provide me with some basic information to get started?” - completely ignoring what Fernando just said. Relevance score: 21%. No actionable response. No empathy. This conversation will take 2-3 more exchanges to resolve, which means more tokens burned and an even worse user experience. The function catches these in real-time so you can route them to quality review.

  • Optimization opportunity:

    • The function flagged this query with high cache potential. “Expense reimbursement” queries are common enough that caching makes sense. Caching stores results so you don't recompute them. Query comes in, system checks if it's seen something similar before; if yes, serve the saved response instantly instead of calling the model. LLM inference is slow and expensive; caching lets you pay once and serve infinitely. Run the numbers: if you’re processing 1,000 of these per day, you’re spending $0.049/day with 1000ms response times. Implement semantic caching at an 80% hit rate and you drop to $0.01/day with 50ms responses. That’s 80% cost reduction and 20x faster response times; same answers, better experience.

How does it work?

This function analyzes every conversation for:

  • Cost & Efficiency

    • Real-time cost across 16 models (OpenAI, Anthropic, Llama, Mistral)

    • Monthly projections, token efficiency, waste detection

  • Quality Monitoring

    • Response grading (A-D) based on relevance, length, personalization

    • Performance ratings, response tracking

  • User Intelligence

    • Sentiment analysis, urgency detection, confusion indicators

    • Multi-Language and intent classification

  • Optimization

    • Model downgrade opportunities. (“Use GPT-3.5 instead, save 90%)

    • Caching recommendations

    • Real-time alerts on high costs, negative sentiment, slow responses.

Here:

Code example
const MODEL_COSTS = { // OpenAI Models (Current as of Nov 2025) 'gpt-5.1': { input: 0.00125, output: 0.01 }, 'gpt-5-mini': { input: 0.00025, output: 0.002 }, 'gpt-5-nano': { input: 0.00005, output: 0.0004 }, 'gpt-4.1': { input: 0.002, output: 0.008 }, 'gpt-4.1-mini': { input: 0.0004, output: 0.0016 }, 'gpt-4o': { input: 0.0025, output: 0.01 }, 'gpt-4o-mini': { input: 0.00015, output: 0.0006 }, 'gpt-4': { input: 0.03, output: 0.06 }, 'gpt-4-turbo': { input: 0.01, output: 0.03 }, 'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 }, 'gpt-3.5-turbo-16k': { input: 0.003, output: 0.004 }, // Anthropic Models (Updated to Claude 4/4.5 - Nov 2025) 'claude-opus-4.1': { input: 0.015, output: 0.075 }, 'claude-sonnet-4.5': { input: 0.003, output: 0.015 }, 'claude-sonnet-4': { input: 0.003, output: 0.015 }, 'claude-haiku-4.5': { input: 0.0008, output: 0.004 }, 'claude-haiku-3.5': { input: 0.0008, output: 0.004 }, // Legacy Claude 3 models (deprecated but may still be in use) 'claude-3-opus': { input: 0.015, output: 0.075 }, 'claude-3-sonnet': { input: 0.003, output: 0.015 }, 'claude-3-haiku': { input: 0.00025, output: 0.00125 }, // Old pricing 'claude-2.1': { input: 0.008, output: 0.024 }, // Open Source Models (infrastructure cost estimates) 'llama2:7b': { input: 0.0001, output: 0.0001 }, 'llama2:13b': { input: 0.0002, output: 0.0002 }, 'llama2:70b': { input: 0.0007, output: 0.0007 }, 'mixtral-8x7b': { input: 0.0003, output: 0.0003 }, // Default fallback 'default': { input: 0.001, output: 0.001 } };

No external API call. No ML models. Just smart pattern match and math.

Token estimation uses statistical analysis. English words average 1.3 tokens, punctuation adds ~0.3 tokens. Multiply that by your model’s per token rate from the table above, and you get cost per conversation. Then it calculates projections by requests/hour, per day, per month. For this example the request took 1 second and costs $0.000049, at 3600 requests/hour that's $0.18/hour, $4.23/day, $127/month. Now you know exactly where your spend is headed.

Every conversation gets this enrichment, automatically, in real-time, with no additional tools.

What you need to get started:

  • A Cribl Stream tenant (start a free trial if needed)

  • AI conversation logs with “assistant.message” and “user.messge” fields

  • 20 minutes to deploy. Really only 5 minutes. The other is a buffer for setting up the playground.

  • Get it now

The pack repo includes a complete pipeline with the code function, and sample data to test with. Install it in your Cribl Stream environment and start analyzing conversations now.

Coming soon: Official Cribl Packs release with more modular components and easier customization options.

Want to contribute?

  • Fork the repo and submit a PR. Once I merge it into the Cribl packs organization, I will review contributions. We’re especially interested in new model pricing data, optimization patterns, and novel dashboard ideas.

Questions? Find me on Cribl Community Slack @aaron or open an issue on Github.

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

More from the blog

get started

Choose how to get started

See

Cribl

See demos by use case, by yourself or with one of our team.

Try

Cribl

Get hands-on with a Sandbox or guided Cloud Trial.

Free

Cribl

Process up to 1TB/day, no license required.