Retrospective: 5 Years as a Site Reliability Engineer at Stripe: Kubernetes 1.32 and SLOs

I still remember my first day as a Site Reliability Engineer (SRE) at Stripe five years ago. The office (then still mostly in-person) was buzzing with talk of scaling payment rails for global Black Friday traffic, and the infrastructure team was midway through a migration from hand-tuned EC2 instances to early Kubernetes 1.18 clusters. Today, as we roll out Kubernetes 1.32 across our production fleet and refine SLO frameworks that now govern 98% of our user-facing services, it’s a fitting time to reflect on the lessons, wins, and hard edges of building reliable payments infrastructure at scale.

Year 1-2: From EC2 to Kubernetes 1.20 — Learning the Payments Baseline

My first two years were spent embedded with the Payments Acceptance team, responsible for the APIs that process millions of checkout requests per second. We were still running a hybrid stack: legacy EC2 workers handling long-tail payment methods, and early Kubernetes 1.18 clusters for core card processing. The learning curve was steep: Stripe’s SRE org doesn’t just maintain infrastructure, it partners with product teams to define reliability targets that balance user experience with engineering velocity.

One of my first projects was migrating our fraud detection workers to Kubernetes 1.20. We hit early snags with pod scheduling latency during traffic spikes, which forced us to dig into the kube-scheduler’s scoring plugins and tune our node affinity rules. That project taught me a core Stripe SRE tenet: infrastructure changes are only as good as the observability backing them. We shipped custom Prometheus exporters for payment method success rates alongside the migration, which became the foundation for our early SLO work.

Year 3-4: Standardizing SLOs Across 100+ Services

By year three, Stripe had grown to over 100 microservices supporting everything from Connect payouts to Terminal in-person payments. Each team was defining reliability metrics ad hoc: some tracked 99th percentile latency, others focused on error rates, and few tied metrics to user impact. My team was tasked with building a centralized SLO framework.

We landed on a three-tier SLO structure: global user-facing SLOs (e.g., 99.99% success rate for card charges), service-level SLOs (e.g., 99.9% of fraud checks complete in <200ms), and infrastructure SLOs (e.g., 99.95% of Kubernetes nodes are ready). Error budgets were tied to team sprint velocity: if a team burned more than 50% of their monthly error budget, feature launches were paused until reliability improved. The pushback was real at first — product teams hated the guardrails — but after a Q3 outage where SLO alerts caught a misconfigured ingress controller 12 minutes faster than our old monitoring, adoption shifted.

Year 5: Kubernetes 1.32 and the Next Era of Reliability

This past year has been defined by our rollout of Kubernetes 1.32, the latest stable release, across all production clusters. 1.32 brought two game-changing features for our workload: GA support for Dynamic Resource Allocation (DRA) for GPU-accelerated fraud models, and stable PodHealthyPolicy for batch jobs, which cut our payout processing retry rate by 18%. We also leveraged 1.32’s improved kubelet metrics to fine-tune our node autoscaling, reducing idle cluster capacity by 22% while maintaining SLO compliance.

But the 1.32 migration wasn’t without hiccups. A deprecated API removal in the metrics-server broke our custom SLO dashboards for 45 minutes before we caught it in staging. That incident led us to build a pre-migration API compatibility checker that now runs against all cluster upgrades, tied directly to our SLO alerting pipeline.

Key Lessons From 5 Years of Stripe SRE Work

SLOs are only useful if they map to user pain: We stopped tracking "infrastructure uptime" early on in favor of "successful payment rate" — the only metric that actually matters to Stripe users.
Kubernetes upgrades are reliability work, not just maintenance: Every 1.x upgrade has shipped features that let us cut costs, reduce toil, or improve SLO compliance. Treating upgrades as first-class engineering work pays off.
SRE is a partnership, not a gatekeeper role: The most successful SLO implementations came from co-designing targets with product teams, not imposing them top-down.

What’s Next?

As I look ahead to year six, our team is focused on extending SLOs to our edge network, and evaluating Kubernetes 1.33’s early support for WebAssembly workloads for low-latency payment edge processing. Five years in, the core mission hasn’t changed: keep the payments flowing, reliably, for millions of businesses worldwide. The tools evolve, but the focus on user impact stays the same.

Retrospective: 5 Years as a Site Reliability Engineer at Stripe: Kubernetes 1.32 and SLOs

Retrospective: 5 Years as a Site Reliability Engineer at Stripe: Kubernetes 1.32 and SLOs

Year 1-2: From EC2 to Kubernetes 1.20 — Learning the Payments Baseline

Year 3-4: Standardizing SLOs Across 100+ Services

Year 5: Kubernetes 1.32 and the Next Era of Reliability

Key Lessons From 5 Years of Stripe SRE Work

What’s Next?

Tags

Author

Stats

Published

You Might Also Like

I deployed the same app on five blockchains. Here's what actually happened

Retrospective: Migrating from Nginx to Kong 3.0 Improved API Observability 40%

Retrospective: SolidJS 2.0 Improved Our Dashboard Interactivity by 40% – No React Rewrite Needed

Retrospective: Moving 2026 Workloads from Intel to Graviton4 Saved 40% on AWS Costs – 1 Year Data

Retrospective: Adopting Podman 5 for 1000 Developer Laptops – Security and Productivity Gains

Retrospective: We Used TypeScript 5.6 for Full-Stack Development and Cut Context Switching by 50%