Most people take the easy route with managed Kubernetes and cloud-native monitoring. I didn’t. I built a full GPU Observability Stack on bare metal. No AWS, no GCP, no safety nets—just five nodes and a mountain of YAML.
The Architecture : The setup spans a 5-node bare-metal cluster (1 master, 4 workers). To simulate a high-performance environment, I’m running 16x simulated NVIDIA A10G GPUs.
The stack is built on three pillars:Metric Collection (DCGM Exporter): The core engine. It interfaces directly with the NVIDIA drivers to expose hardware-level telemetry.
Storage & Querying (Prometheus): Scrapes the DCGM endpoints and stores time-series data.
Logging (Loki): Captures the "why" behind the "what," pulling logs from every pod to correlate errors with performance spikes.
Visualization (Grafana): The single pane of glass for real-time health.
Current Cluster Performance :
After 12+ hours of continuous, stable metric collection, the numbers are in:
MetricStatusAllocated : GPU Memory 92.1 GB
Avg. GPU Utilization 46.3%
Uptime (Metrics) 12h 14m
Why Bare Metal?
Cloud services hide the complexity, but they also hide the bottleneck. By running this on-prem:
Zero Latency: Metrics don't travel over the public internet.
Full Control: I control the driver versions, the kernel parameters, and the hardware interrupts.
Cost Efficiency: No "managed service" tax for high-bandwidth GPU telemetry.
Troubleshooting : Check pod status and reason
kubectl get pods -n monitoring kubectl describe podCheck pod logs-n monitoring # Look for: Events section at the bottom — it tells you exactly why it failed
kubectl logsPrometheus Not Scraping GPU Metrics .Check if DCGM service is reachable-n monitoring kubectl logs -n monitoring --previous # --previous shows logs from crashed container
kubectl get svc -n gpu-operator | grep dcgm kubectl port-forward svc/nvidia-dcgm-exporter -n gpu-operator 9400:9400 & curl http://localhost:9400/metrics | head -20 kill %1Check scrape config was applied
kubectl get secret -n monitoring | grep prometheus kubectl describe prometheuses -n monitoringCheck PrometheusRule was created
kubectl get prometheusrule -n monitoringCheck label matches
kubectl get prometheusrule gpu-alerts -n monitoring -o yaml | grep labels -A5
# --Must have label: release: kube-prometheus-stack
#Kubernetes #GPUInfrastructure #Prometheus #Grafana #DevOps #MLOps #Infrastructure #Observability #SelvamaniS
