Blog image for How to Monitor NVIDIA GPU Metrics with Cribl Edge & Stream (Complete Tutorial)

How to Monitor NVIDIA GPU Metrics with Cribl Edge & Stream (Complete Tutorial)

Last edited: August 13, 2025

If you’re running AI, ML, or data-intensive workloads on GPUs, monitoring their performance is critical. Overheating, under-utilization, or memory bottlenecks can cost you thousands in cloud bills and potential downtime.

This guide walks you through collecting real-time GPU telemetry using nvidia-smi, sending it to Cribl Edge, routing it through Cribl Stream, and using Cribl Search to analyze the data—step by step.

GPU Monitoring

Why GPU Monitoring Matters

GPUs power AI inference, deep learning training, and high-performance data pipelines. Without visibility into:

  • GPU utilization

  • Temperature

  • Memory usage

  • Power draw

you risk inefficiencies, high costs, and unexpected failures.

Cribl Edge + Stream provides a flexible observability platform to capture and route this data easily.


Step 1: Spin Up an NVIDIA GPU Instance

For this tutorial:

  • AWS instance: g5.xlarge (with NVIDIA A10G 24 GB GPU)

  • AMI: Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2023)

    AMI ID: ami-00b530caaf8eee2c5 (as of July 2025)

After launching your instance:

  • Connect via SSH from your terminal OR use the AWS Console's EC2 Instance Connect (browser-based shell).


Step 2: Install Cribl Edge

In your Cribl Cloud org, navigate to Edge → Add a Node. Copy the provided installation command. It should look similar to this:

Code example
curl 'https://your-org.cribl.cloud/init/install-edge.sh?group=default_fleet&token=xXxXx&user=cribl&user_group=cribl&install_dir=%2Fopt%2Fcribl' | bash -

Run this in your GPU instance shell (either via SSH or AWS Console). Once installed, confirm that the node appears in Edge Leader → default_fleet.


Step 3: Capture GPU Metrics with Exec Source

In your Cribl Edge UI:

  • Navigate to Fleets → default_fleet → Collect.

  • Add a new Source of type Exec.

    nvidia-smi

Use this command in the Command field:

Code example
nvidia-smi \ --query-gpu=timestamp,index,name,uuid,driver_version,pci.bus_id,persistence_mode,power.draw,power.limit,temperature.gpu,fan.speed,utilization.gpu,utilization.memory,utilization.encoder,utilization.decoder,memory.total,memory.used,memory.free,clocks.sm,clocks.mem \ --format=csv,noheader,nounits | \ while IFS=, read -r ts idx name uuid drv pci persist pdraw plimit temp fan ugpu umem uenc udec mtot mused mfree csm cmem; do iso_ts=$(date -d "$ts" --utc +%Y-%m-%dT%H:%M:%SZ) echo "{ \"timestamp\":\"$iso_ts\", \"index\":$idx, \"name\":\"$name\", \"uuid\":\"$uuid\", \"driver_version\":\"$drv\", \"pci_bus_id\":\"$pci\", \"persistence_mode\":\"$persist\", \"power_draw\":$pdraw, \"power_limit\":$plimit, \"temperature_gpu\":$temp, \"fan_speed\":$fan, \"utilization_gpu\":$ugpu, \"utilization_memory\":$umem, \"utilization_encoder\":$uenc, \"utilization_decoder\":$udec, \"memory_total\":$mtot, \"memory_used\":$mused, \"memory_free\":$mfree, \"clocks_sm\":$csm, \"clocks_mem\":$cmem }" done

What This Command Does

  • Runs nvidia-smi to pull GPU stats.

  • Converts the timestamp to ISO 8601 UTC.

Outputs data as JSON for easy ingestion into Cribl.


Step 4: Send Data to Cribl Stream and Lake

  • Connect your Exec Source to an HTTP destination that sends logs to port 10020 on the Cribl Stream service.

    promote-gpu-fields
  • Route this data to Cribl Lake for historical storage and easy searching.

Quickconnect

Step 5: Analyze and Visualize Metrics

Go to Cribl Search in your Cloud org. You can query GPU metrics like:

  • temperature_gpu (for overheating trends)

  • power_draw vs power_limit

  • memory_used and memory_free

  • utilization_gpu for load analysis

gpu memory dashboard

From here, you can:

  • Build dashboards in your preferred tool.

  • Create alerts when temperatures exceed thresholds.


Final Thoughts

With Cribl Edge and Stream, you get:

  • ✅ Real-time GPU observability

  • ✅ Flexible routing & enrichment

  • ✅ No vendor lock-in

Ready to monitor and optimize your NVIDIA GPUs?
👉 Sign up for Cribl Cloud and start today!

Or download the GitHub gist Here!

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

More from the blog

get started

Choose how to get started

See

Cribl

See demos by use case, by yourself or with one of our team.

Try

Cribl

Get hands-on with a Sandbox or guided Cloud Trial.

Free

Cribl

Process up to 1TB/day, no license required.