Scaling the Metal: Building a GPU Observability Stack from Scratch

Most people take the easy route with managed Kubernetes and cloud-native monitoring. I didn’t. I built a full GPU Observability Stack on bare metal. No AWS, no GCP, no safety nets—just five nodes and a mountain of YAML.

The Architecture : The setup spans a 5-node bare-metal cluster (1 master, 4 workers). To simulate a high-performance environment, I’m running 16x simulated NVIDIA A10G GPUs.

The stack is built on three pillars:Metric Collection (DCGM Exporter): The core engine. It interfaces directly with the NVIDIA drivers to expose hardware-level telemetry.

Storage & Querying (Prometheus): Scrapes the DCGM endpoints and stores time-series data.

Logging (Loki): Captures the "why" behind the "what," pulling logs from every pod to correlate errors with performance spikes.

Visualization (Grafana): The single pane of glass for real-time health.

Current Cluster Performance :

After 12+ hours of continuous, stable metric collection, the numbers are in:

MetricStatusAllocated :  GPU Memory  92.1 GB

Avg. GPU Utilization 46.3%

Uptime (Metrics) 12h 14m

Why Bare Metal?

Cloud services hide the complexity, but they also hide the bottleneck. By running this on-prem:

Zero Latency: Metrics don't travel over the public internet.

Full Control: I control the driver versions, the kernel parameters, and the hardware interrupts.

Cost Efficiency: No "managed service" tax for high-bandwidth GPU telemetry.

Troubleshooting : Check pod status and reason

kubectl get pods -n monitoring
kubectl describe pod  -n monitoring
# Look for: Events section at the bottom — it tells you exactly why it failed
Check pod logs
kubectl logs  -n monitoring
kubectl logs  -n monitoring --previous
# --previous shows logs from crashed container
Prometheus Not Scraping GPU Metrics .Check if DCGM service is reachable
kubectl get svc -n gpu-operator | grep dcgm
kubectl port-forward svc/nvidia-dcgm-exporter -n gpu-operator 9400:9400 &
curl http://localhost:9400/metrics | head -20
kill %1
Check scrape config was applied
kubectl get secret -n monitoring | grep prometheus
kubectl describe prometheuses -n monitoring
Check PrometheusRule was created
kubectl get prometheusrule -n monitoring
Check label matches
kubectl get prometheusrule gpu-alerts -n monitoring -o yaml | grep labels -A5
# --Must have label: release: kube-prometheus-stack
 

#Kubernetes  #GPUInfrastructure  #Prometheus  #Grafana  #DevOps  #MLOps  #Infrastructure #Observability  #SelvamaniS