Kubernetes Observability in 30 Minutes: Prometheus, Grafana, and Custom Alerts That Actually Tell You Something
You deployed your app to Kubernetes. Pods are running. Services are up. Everything looks green. But then someone asks: "How do you know it's actually healthy?"
That question hits different when your only answer is kubectl get pods. Running containers β working application. Real observability means you can answer: Are requests succeeding? Is the database slow? Is the cache doing anything? When something breaks at 3 AM, what tells you before your users do?
This tutorial walks through setting up production-grade Kubernetes observability from scratch β Prometheus for metrics collection, Grafana for dashboards, and Alertmanager for notifications β all deployed via GitOps. By the end, you'll have a system that doesn't just show green bars but tells you meaningful stories about your application's health.
Why Observability Matters (And Why dashboards Aren't Enough)
Most Kubernetes tutorials stop at deployment. They show you how to get containers running but skip the part where you actually understand what's happening inside them. CloudWatch shows node metrics. kubectl top shows resource usage. Neither tells you that your cache hit rate dropped to 20% or that database inserts are taking 3x longer than usual.
The difference between monitoring and observability is the difference between a smoke alarm and a dashboard that tells you which room is on fire, how fast it's spreading, and whether the sprinklers are working. You need:
- Application metrics β request rates, error rates, latencies, cache performance
- Infrastructure metrics β CPU, memory, disk, network per pod and node
- Correlation β the ability to see that latency spikes when the cache miss rate climbs
- Alerting β proactive notification when things go wrong, not just dashboards you remember to check
Prometheus + Grafana + Alertmanager gives you all four. Let's build it.
Prerequisites
- A running Kubernetes cluster (EKS, GKE, AKS, or minikube β any works)
-
kubectlconfigured and pointing at your cluster - ArgoCD installed (for GitOps deployment) or willingness to apply manifests directly
- A Node.js application to instrument (I'll show the pattern; adapt it to your stack)
- Basic familiarity with Kubernetes resources (Deployments, Services, ConfigMaps)
Step 1: Instrument Your Application
Before Prometheus can scrape anything, your app needs to emit metrics. For Node.js, the prom-client library makes this straightforward.
Install the dependency:
npm install prom-client
Create a metrics module (src/metrics.js):
const client = require('prom-client');
const register = new client.Registry();
// Default metrics (GC, event loop, memory, etc.)
client.collectDefaultInfo({ register });
// HTTP request duration histogram
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],
registers: [register],
});
// Database query duration histogram
const dbQueryDuration = new client.Histogram({
name: 'db_query_duration_seconds',
help: 'Duration of database queries in seconds',
labelNames: ['operation', 'table'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5],
registers: [register],
});
// Cache operations counter
const cacheOperations = new client.Counter({
name: 'cache_operations_total',
help: 'Total cache operations',
labelNames: ['operation', 'result'],
registers: [register],
});
// Active database connections gauge
const dbConnectionsActive = new client.Gauge({
name: 'db_connections_active',
help: 'Number of active database connections',
registers: [register],
});
module.exports = {
register,
httpRequestDuration,
dbQueryDuration,
cacheOperations,
dbConnectionsActive,
};
Add HTTP middleware to track every request (src/middleware.js):
const { httpRequestDuration } = require('./metrics');
function metricsMiddleware(req, res, next) {
// Exclude the /metrics endpoint itself from tracking
if (req.path === '/metrics') return next();
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode,
});
});
next();
}
module.exports = { metricsMiddleware };
Track database and cache operations wherever they occur:
const { dbQueryDuration, cacheOperations, dbConnectionsActive } = require('./metrics');
// Database wrapper
async function query(operation, table, fn) {
const end = dbQueryDuration.startTimer({ operation, table });
try {
const result = await fn();
return result;
} finally {
end();
}
}
// Cache tracking
async function cacheGet(key) {
const result = await redis.get(key);
cacheOperations.inc({
operation: 'get',
result: result !== null ? 'hit' : 'miss',
});
return result;
}
async function cacheSet(key, value) {
await redis.set(key, value);
cacheOperations.inc({ operation: 'set', result: 'success' });
}
Expose the metrics endpoint (src/server.js):
const { register } = require('./metrics');
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Hit http://localhost:3000/metrics and you should see ~100 lines of Prometheus-format metrics: counters, histograms, gauges, all labeled and ready to scrape.
Step 2: Deploy the Observability Stack
The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, Node Exporter, and kube-state-metrics β everything you need in one deploy.
If you're using ArgoCD, create an Application manifest (monitoring/argocd-app.yaml):
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: monitoring
namespace: argocd
spec:
project: default
source:
repoURL: https://prometheus-community.github.io/helm-charts
chart: kube-prometheus-stack
targetRevision: "58.0.0"
helm:
values: |
grafana:
service:
type: LoadBalancer
adminPassword: admin
prometheus:
prometheusSpec:
retention: 7d
resources:
requests:
memory: 512Mi
destination:
server: https://kubernetes.default.svc
namespace: monitoring
syncPolicy:
automated:
prune: true
selfHeal: true
Or install directly with Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set grafana.service.type=LoadBalancer \
--set grafana.adminPassword=admin \
--set prometheus.prometheusSpec.retention=7d
Wait 2-3 minutes for all pods to come up:
kubectl get pods -n monitoring -w
You should see: prometheus-monitoring-0, monitoring-grafana-..., alertmanager-monitoring-0, and several exporters.
Step 3: Connect Prometheus to Your Application
This is where most people get stuck. The instinct is to use additionalScrapeConfigs in the Helm values. Don't. The correct approach is a ServiceMonitor β a CRD that the Prometheus Operator watches to discover scrape targets automatically.
First, make sure your application's Service has a named port:
apiVersion: v1
kind: Service
metadata:
name: gitops-api
namespace: three-tier
labels:
app: gitops-api
spec:
selector:
app: gitops-api
ports:
- name: http # This name matters!
port: 80
targetPort: 3000
Then create the ServiceMonitor (monitoring/servicemonitor.yaml):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: gitops-api
namespace: monitoring
labels:
release: monitoring # Must match your Helm release label
spec:
selector:
matchLabels:
app: gitops-api
namespaceSelector:
matchNames:
- three-tier
endpoints:
- port: http
path: /metrics
interval: 15s
The release: monitoring label is critical β the Prometheus Operator uses it to discover ServiceMonitors. If this label doesn't match your Helm release name, Prometheus silently ignores your ServiceMonitor and you'll spend hours debugging why targets aren't showing up.
Apply it:
kubectl apply -f monitoring/servicemonitor.yaml
Verify the target is discovered in Prometheus: open the Prometheus UI (kubectl port-forward svc/prometheus-operated 9090 -n monitoring) and navigate to Status β Targets. You should see your application endpoint with state "UP".
Step 4: Build Meaningful Dashboards
Skip the pretty-but-useless dashboards. Build ones that tell a story. Here's a layout that covers the three layers that matter:
Row 1 β HTTP Layer (Is the API serving traffic?)
{
"title": "Request Rate",
"type": "timeseries",
"targets": [
{ "expr": "sum(rate(http_request_duration_seconds_count[5m])) by (route)", "legendFormat": "{{route}}" }
]
}
Add panels for Error Rate (rate(...{status_code=~"5.."}) / rate(...) as percentage) and P95 Latency (histogram_quantile(0.95, rate(..._bucket[5m])) by route).
Row 2 β Data Layer (Is the backend keeping up?)
- Requests by Status (pie chart: 200/201/404/503 distribution)
- Cache Hit/Miss Ratio (pie chart:
cache_operations_total{operation="get",result="hit"}vsmiss) - DB Query Duration P95 (by operation type β inserts vs selects)
- Active DB Connections (gauge, 0β10 scale, yellow at 6, red at 8)
Row 3 β Infrastructure (Do we need more resources?)
- DB Queries per Second (insert and select rates)
- Pod Memory Usage
- Pod CPU Usage
Save the dashboard JSON as a ConfigMap and deploy it via ArgoCD so it's version-controlled and reproducible:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
app-dashboard.json: |
{YOUR_DASHBOARD_JSON_HERE}
Grafana's sidecar automatically picks up ConfigMaps with the grafana_dashboard label and imports them.
Step 5: Set Up Alerts That Don't Cry Wolf
Nine custom alert rules across four categories β enough to catch real problems without paging you at 3 AM for a blip:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
labels:
release: monitoring
spec:
groups:
- name: api-health
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_request_duration_seconds_count{status_code=~"5.."}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5%"
description: "API error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency above 2s"
description: "Route {{ $labels.route }} P95 latency is {{ $value }}s"
- name: database
rules:
- alert: SlowDatabaseQueries
expr: |
histogram_quantile(0.95,
sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation)
) > 0.5
for: 10m
labels:
severity: warning
- alert: ConnectionPoolExhaustion
expr: db_connections_active > 8
for: 5m
labels:
severity: critical
annotations:
summary: "DB connection pool nearing exhaustion"
- name: cache
rules:
- alert: LowCacheHitRate
expr: |
sum(rate(cache_operations_total{operation="get",result="hit"}[5m]))
/ sum(rate(cache_operations_total{operation="get"}[5m])) < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Cache hit rate below 50%"
- name: pods
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
- alert: PodMemoryPressure
expr: |
container_memory_working_set_bytes
/ container_spec_memory_limit_bytes > 0.85
for: 10m
labels:
severity: warning
Key design principles for alerts:
-
Always use
for:clauses. A 30-second spike shouldn't wake you up.for: 5mmeans the condition must persist for 5 minutes before alerting. - Thresholds should be meaningful. 5% error rate is objectively bad. 2-second P95 latency means real users are suffering. Don't alert on 1% error rates β you'll burn out on noise.
-
Group by severity.
critical= page someone now.warning= investigate in the morning.
Step 6: Configure Alertmanager Routing
Alertmanager decides where alerts go. A basic config that routes critical alerts to Slack and warnings to email:
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-monitoring-config
namespace: monitoring
type: Opaque
stringData:
alertmanager.yaml: |
route:
receiver: slack
group_by: [alertname, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: slack-urgent
repeat_interval: 1h
- match:
severity: warning
receiver: email
receivers:
- name: slack
slack_configs:
- api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: '#monitoring'
- name: slack-urgent
slack_configs:
- api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: '#incidents'
- name: email
email_configs:
- to: oncall@yourcompany.com
from: alertmanager@yourcompany.com
smarthost: smtp.yourcompany.com:587
Real-World Scenarios
Scenario 1: The Silent Degradation
Your cache hit rate slowly drops from 85% to 40% over two hours. No pods crash. No 500 errors. But database query latency triples because every request now hits Postgres instead of Redis. The LowCacheHitRate alert catches this 15 minutes in, long before users notice.
Scenario 2: The Connection Leak
A new deploy introduces a database connection leak. Active connections climb from 3 to 8 over 10 minutes. The ConnectionPoolExhaustion alert fires at 8 connections, you roll back before the pool hits 10 and the app becomes unresponsive.
Scenario 3: The Noisy Neighbor
Another team deploys a memory-hungry job on the same node. Your pod's memory pressure crosses 85%. The PodMemoryPressure warning gives you time to request a node with more capacity or move the workload before OOMKill hits.
FAQ / Troubleshooting
Q: Prometheus shows my target as "DOWN" or missing entirely.
A: Check three things: (1) Does your Service have a named port (name: http, not just port: 80)? ServiceMonitors reference ports by name. (2) Does your ServiceMonitor have the release label matching your Helm release? (3) Is the namespace in namespaceSelector.matchNames correct?
Q: My ServiceMonitor exists but metrics aren't appearing in Grafana.
A: Go to Prometheus UI β Status β Targets. If the target isn't listed, the ServiceMonitor isn't being picked up. Check label selectors. If it's listed but showing errors, the /metrics endpoint might not be reachable from the cluster network.
Q: Alerts aren't firing.
A: Check Prometheus UI β Alerts. Are the rules loaded? Is the expression evaluating? Test your PromQL directly in the query bar. Common mistake: metric names with typos or label mismatches.
Q: Dashboard shows "No data."
A: Verify the data source is configured (Grafana β Settings β Data Sources β Prometheus). Check that the namespace in your query matches where the metrics are. Use {namespace="three-tier"} to scope queries.
Q: How much resources does this stack need?
A: For a small cluster (< 20 nodes), budget: Prometheus 512Mi-1Gi RAM, Grafana 256Mi, Alertmanager 128Mi. The Helm chart defaults are reasonable for dev/test. Increase retention and memory for production.
Conclusion
Observability isn't a nice-to-have β it's the difference between guessing and knowing. The combination of application-level metrics (prom-client), infrastructure metrics (node exporter, kube-state-metrics), powerful querying (PromQL), visualization (Grafana), and proactive alerting (Alertmanager) gives you a complete picture of your system's health.
The whole stack deploys via GitOps β push to main, ArgoCD syncs. No manual dashboard creation. No ad-hoc alert rules. Everything version-controlled, reproducible, and auditable.
Start with the metrics that answer "is it working?" β request rate, error rate, latency. Add depth from there: cache performance, database health, resource pressure. Let the alerts do the watching so you don't have to.
Your future self β the one getting paged at 3 AM β will thank you.












