How to build an AI telemetry architecture that serves five teams without duplicate collection

The technical case for collecting AI telemetry once, governing it correctly, and making all of it queryable from one place — regardless of where it lands.

Most AI observability problems aren't instrumentation problems. The OpenTelemetry GenAI SIG has done substantial work: semantic conventions now cover 14 LLM providers, 7 vector databases, and 8 major frameworks. DCGM exposes over 100 hardware metrics per GPU per collection interval. vLLM, Triton, and TensorRT-LLM all publish Prometheus-compatible inference metrics. The signals exist.

The problem is what happens after collection. Five teams need different things from the same underlying events — at different fidelity, different retention timelines, and different cost tiers. Without an architecture that governs how that telemetry flows, you end up with one of two outcomes: everything going to one expensive destination that only one team actually owns, or five separate collection passes running redundantly across the same infrastructure.

Neither gives you complete AI observability. Both are expensive.

This post describes the architecture that does work: an infrastructure layer that collects once, normalizes across heterogeneous schemas, applies policy in-flight, and routes per consumer — with Cribl Search as the unified investigation surface where the complete picture is queryable regardless of where data ultimately lands.

The schema problem: OTel GenAI and GPU metrics aren't aligned yet

Before you can route AI telemetry correctly, you have to deal with the fact that it arrives in inconsistent shapes.

On the LLM side, the OpenTelemetry GenAI semantic conventions define the gen_ai.* namespace: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and span events for prompts and completions. The coverage is real — LangChain, LlamaIndex, LangGraph, CrewAI, and LiteLLM all have instrumentation libraries built against the spec. Bedrock, Azure OpenAI, Anthropic, and Vertex are all covered.

The catch is that the spec is still in Development, not Stable. Attribute names have changed between versions. Event structures have shifted. Providers implement subsets differently. A team that instruments directly against the current spec will break when the spec evolves — which it will, repeatedly, over the next 12+ months.

On the GPU side, there's no unified standard at all. DCGM exposes metrics in its own namespace. vLLM publishes Prometheus metrics under vllm:*. Triton uses its own naming conventions. TensorRT-LLM is different again. A 1,000-GPU cluster running mixed workloads across multiple inference frameworks produces telemetry in four incompatible schemas simultaneously.

The infrastructure layer has to normalize across all of this before anything reaches a downstream consumer — mapping divergent attribute names to a consistent internal schema, handling spec version mismatches, and insulating consumers from upstream volatility. When the OTel GenAI spec changes, the normalization rule updates once. No downstream consumer breaks.

Ingestion: one collection pass across every source

A complete AI telemetry architecture ingests from four source types:

LLM application spans arrive via OTLP. Cribl Stream receives them natively, extracts the gen_ai.* attributes, and begins normalization. For providers or frameworks with non-standard instrumentation, field-level mappings handle the translation at ingest so downstream consumers see a consistent schema regardless of which provider generated the span.

GPU infrastructure metrics require two paths depending on deployment. For cloud-hosted GPU clusters, Stream scrapes Prometheus-compatible endpoints on vLLM and Triton directly — pulling vllm:num_requests_running, vllm:gpu_cache_usage_perc, and inference throughput metrics alongside DCGM hardware metrics where those endpoints are exposed. For on-premises GPU clusters and regulated environments, Cribl Edge deploys at the node level, pulling DCGM metrics locally and forwarding normalized data outbound — enforcing redaction and enrichment at the source before anything leaves the cluster perimeter.

Provider API telemetry for managed services (Bedrock, Azure OpenAI, Anthropic) that don't support OTLP directly is pulled via Cribl's API source connectors, normalizing billing and usage signals into the same gen_ai.*-aligned schema as instrumented spans.

Network egress data from CASB, DLP, NGFW, and proxy logs closes the shadow AI gap. OTel spans only see instrumented applications. Network egress sees all AI activity — including employees accessing personal AI accounts, unauthorized tools, and consumer services that completely bypass enterprise controls. Ingesting both and correlating them at the infrastructure layer is the only way to surface the full AI footprint.

One collection pass. No duplicate agents. No redundant scraping.

In-flight policy: what happens before data reaches any destination

Three operations happen in the infrastructure layer before any telemetry reaches a downstream consumer:

PII redaction. OTel's binary capture flag is a false choice — capture everything (compliance liability) or capture nothing (no forensic signal). In-flight redaction masks sensitive substrings — credentials, PHI, PII, customer records — while preserving the structural signal. The span arrives at every downstream consumer with the sensitive content already masked. No team receives raw prompt content that should never have left the collection layer.

Workload enrichment. GPU metrics arrive from the hardware layer without context. Which team's job was running on that GPU? Which model version? Which product feature? Enrichment at ingest attaches team, workload, model version, and feature tags to every GPU metric before fan-out. FinOps receives pre-attributed data — chargeback is a query against the enriched record, not a data engineering project to reconstruct attribution after the fact.

Cost metric derivation. Token counts in gen_ai.usage.input_tokens and gen_ai.usage.output_tokens are raw signal. Per-request cost metrics — derived from token counts multiplied by model-specific pricing — can be computed in the infrastructure layer at ingest and emitted as separate metrics. FinOps receives cost directly without needing to join billing APIs with telemetry after the fact.

Fan-out: per-consumer routing at the right fidelity and cost

With normalized, enriched, redacted telemetry flowing through the infrastructure layer, fan-out routes data per consumer at the fidelity and cost tier each destination actually needs:

SRE / APM: Latency, TTFT, error rate, and GPU utilization anomalies — near-real-time, low-cardinality aggregates. High-volume raw trace data doesn't belong here; pre-aggregated signals do.

Security / SIEM: Security-relevant events — PII match flags, anomalous model access, prompt content that triggered redaction rules, shadow AI egress correlations — routed at SIEM-appropriate fidelity. The SIEM gets what it needs for detection and alerting without absorbing the full volume of LLM traces it doesn't use.

FinOps: Pre-aggregated token and GPU cost metrics tagged by workload, model, team, and feature. Hourly or daily rollups rather than per-request detail. Cost-tiered storage since FinOps queries aggregate data, not individual traces.

Eval platforms (LangSmith, Arize, etc.): High-quality sampled traces — the subset most useful for model evaluation — rather than full-volume firehose. Eval platforms are priced per trace; sending everything is expensive and largely redundant for quality analysis.

Cribl Lake: Full-fidelity archive of everything — every prompt, completion, retrieval document, tool call, GPU metric, and network egress event — at object-storage economics. This is the layer where retention is measured in years, not days.

The same underlying events reach all five destinations in one collection pass. No consumer receives data another consumer is already paying to store.

Cribl Search: where the complete picture comes together

Routing telemetry to the right destinations solves the cost and governance problem. It doesn't solve the investigation problem.

When a security analyst needs to correlate an anomalous prompt with the GPU workload running at the same time, the data is in two different destinations. When an ML engineer needs to understand whether a quality regression preceded or followed a specific infrastructure change, the trace is in Lake and the GPU state is in the APM tool. When FinOps needs to understand which specific agent runs drove last month's cost spike, the cost metrics are in the FinOps tool and the traces that explain them are somewhere else.

The complete picture requires querying across destinations — not stitching exports together manually.

Cribl Search provides two engines that cover the full data landscape. The lakehouse engine ingests directly for near-real-time investigation — no additional routing required for data you want fast access to. The federated engine queries data in place across every destination where AI telemetry already lives: Cribl Lake, S3, Azure Blob, and the proprietary stores of tools like Datadog, Splunk Software, Elastic, and New Relic — without moving data, without rehydration, and without requiring those tools to export anything.

A security analyst investigating a shadow AI incident queries across the SIEM's data, the network egress record, and the Lake archive in a single search. An ML engineer debugging a quality regression pulls the full trace from Lake and correlates it with GPU state from the infrastructure metrics store. Copilot translates natural language questions to KQL — no query language expertise required.

The architecture described in this post — one collection pass, in-flight policy, per-consumer fan-out — is what makes the infrastructure layer work. Cribl Search is what makes the complete picture visible.

What this looks like in practice

The before state: four teams running separate collection passes on the same GPU cluster. DCGM scraped four times. Conflicting schemas in four tools. A spec update breaks all four consumers simultaneously. Shadow AI invisible to all of them.

The after state: one collection pass via Stream and Edge. DCGM scraped once, normalized, enriched with workload attribution, and fanned out to every consumer. OTel GenAI spans normalized across provider schema differences and spec versions. Shadow AI surfaced by correlating egress data with instrumented spans. Every consumer receives exactly what they need at the cost tier appropriate to their use case. And Cribl Search gives every team the ability to investigate the complete picture — not just the slice that was routed to their tool.

That's the architecture. One collection pass. Every silo queryable. The complete picture, finally in one place.

Want to walk through what this architecture looks like for your AI stack? Talk to us.

How to build an AI telemetry architecture that serves five teams without duplicate collection

The schema problem: OTel GenAI and GPU metrics aren't aligned yet

Ingestion: one collection pass across every source

In-flight policy: what happens before data reaches any destination

Fan-out: per-consumer routing at the right fidelity and cost

Cribl Search: where the complete picture comes together

What this looks like in practice

Optimizing observability spend: Practical strategies for SREs

Why full-text search matters for modern log analytics

Using OTel to light up the Elastic Hosts view with Cribl Edge metrics

Try Your Own Cribl Sandbox

Products & Services

Learning & Resources

Company

Get Started

NewsLetter

4.7