In financial-grade environments, the technology roadmap is not a planning artifact — it is a risk contract. Every prioritization decision that defers a security migration, postpones a runtime upgrade, or accepts technical debt without formal registration is, in practice, an overdraft check against system resilience. This postmortem reconstructs how a sequence of apparently rational prioritization decisions — made under legitimate delivery pressure — culminated in a 4-hour partial outage in a payments processing pipeline, and what we changed structurally so the roadmap stops being the silent vector of incidents.
What Happened: The Anatomy of the Failure
The affected system was a payment event ingestion pipeline built on MSK (managed Kafka), processed by consumers running on EKS with Kubernetes 1.24, and persisting to DynamoDB with streams enabled for downstream analytics via Glue. The failure did not begin with a production alert — it began six months earlier, when an ADR proposing migration to Kubernetes 1.28 was deferred for two consecutive sprints to prioritize product features. Those two sprints became four, and the ADR was silently archived in the backlog.
When AWS announced end of extended support for EKS version 1.24, the platform team opened an urgent ticket. The ticket was triaged as P2 — important, but not urgent — because "the cluster is still running." What was not visible in the ticket was that the aws-load-balancer-controller add-on at the pinned version (v2.4.1) had a known incompatibility with the kernel version of the AMI that would be applied in the next mandatory EKS security patch. That patch was applied automatically during the scheduled maintenance window, and Kafka consumer pods began failing health checks with ECONNREFUSED on the internal ALB endpoint — silently, without an immediate alarm, because the CloudWatch Alarm threshold was calibrated for 5-minute latency, not short-duration connectivity failure.
The result: MSK consumer lag grew from ~200ms to 47 seconds in 18 minutes, triggering downstream circuit breakers and causing transaction rejections at the payment gateway.
Incident Timeline
T-180 days: ADR archived — ADR-047 (EKS 1.24 → 1.28 migration) is deferred under feature pressure and enters 'DEFERRED' state with no mandatory review date or designated owner. No associated risk register entry.
T-45 days: P2 ticket opened — Platform team opens EKS version migration ticket after AWS communication. Ticket is triaged as P2 without add-on compatibility analysis. The
aws-load-balancer-controllerv2.4.1 add-on remains pinned.T-0: Security patch applied (02:15 UTC) — EKS maintenance window applies updated AMI with new kernel. Kafka consumer pods restart and fail ALB internal health checks. The incompatible add-on cannot correctly register targets in the Target Group.
T+18 min: MSK lag reaches 47s — Consumer lag on topic
payments.events.raw(partitions: 12, replication factor: 3) scales from ~200ms to 47s. CloudWatch Alarm does not fire — threshold configured atSumOfOffsetLag > 100000with a 5-minute period, masking initial growth.T+23 min: Downstream circuit breaker activated — The payment authorization service detects absence of event confirmations for 20s and activates circuit breaker (Hystrix-compatible, threshold: 50% failures in 10s). Transactions begin being rejected with HTTP 503.
T+31 min: Business alert fires — Transaction approval rate dashboard (SLO: 99.5% in 5-min window) fires alert for on-call. On-call engineer begins investigation — first suspect is MSK, not EKS.
T+58 min: Root cause identified — Log correlation via CloudWatch Logs Insights between pod restart events and ALB health check failures points to the incompatible add-on. Manual rollback of the add-on to v2.6.2 (compatible with new AMI) initiated.
T+4h 12min: Full recovery — After add-on rollback, MSK lag drainage, and reprocessing of lost events via DLQ (SQS Dead Letter Queue with 14-day retention), the pipeline returns to nominal state. Lag returns to <500ms.
Root Cause: Governance Debt, Not Technical Failure: The root cause was not the security patch, nor the incompatible add-on, nor the poorly calibrated alarm threshold. The root cause was the absence of a mechanism that made the cost of deferring roadmap decisions visible and traceable. ADR-047 was deferred without a risk register entry, without an owner, without a review date, and without transitive dependency analysis (add-on version × AMI version × Kubernetes version). Each of these gaps individually is manageable; combined, they created an invisible failure window that only materialized when an external event — the automatic patch — removed the last safety margin. In financial-grade environments, this is not bad luck. It is the predictable consequence of treating roadmap governance as bureaucratic process rather than risk control.
The Structure That Failed: ADRs Without Teeth
Architecture Decision Records are frequently treated as retroactive documentation — a record of what was decided, not a prospective control instrument. In this case, the ADR process existed, but had three critical design flaws that rendered it ineffective as a governance mechanism.
First flaw: DEFERRED state without consequence. ADR-047 entered DEFERRED state without generating any risk artifact. A deferred ADR in a financial system should automatically create an entry in the operational risk register with severity proportional to potential impact, a mandatory owner, and a maximum review date. Without this, DEFERRED is functionally equivalent to REJECTED — the decision simply disappears.
Second flaw: absence of transitive dependency analysis. The ADR process did not require mapping dependencies between the proposed decision and the components that would be affected. A Kubernetes version migration should mandatorily include a compatibility matrix: EKS version × managed add-on versions × AMI versions × network configurations (VPC CNI, security groups for pods). This matrix exists in AWS documentation — the problem was that nobody consulted it systematically at the time of deferral.
Third flaw: disconnection between ADRs and maintenance windows. The change management system had no integration with the deferred ADR register. When the EKS maintenance window was scheduled, there was no automatic check for pending ADRs that could be impacted. This integration — trivial to implement via Lambda + EventBridge + DynamoDB — would have surfaced the risk before the window.
Propagation Flow: From Deferred ADR to Production Incident
This diagram shows how a deferred roadmap decision without adequate governance cascades until it causes a production outage — and where controls should have intercepted the failure.
📋 Governance Layer — ADR & Risk
- ADR-047 EKS 1.24→1.28 (ci)
- DEFERRED State no owner / no date (security)
- Risk Register (missing entry) (security)
🔧 Platform Layer — EKS & Add-ons
- EKS 1.24 end-of-support (compute)
- aws-lb-controller v2.4.1 (pinned) (network)
- Maintenance Window AMI kernel patch (ci)
🟧 AWS — Data Pipeline
- MSK Kafka 12 partitions / RF:3 (messaging)
- Kafka Consumer EKS pods (failing HC) (compute)
- SQS DLQ 14-day retention (messaging)
- DynamoDB streams enabled (storage)
🚨 Observability & Impact
- CloudWatch Alarm lag threshold: 5min (external)
- Circuit Breaker 50% fail / 10s window (compute)
- Payment Gateway HTTP 503 rejections (edge)
Flows
- adr -> deferred: deferred without risk
- deferred -> risk_reg: missing entry
- deferred -> alb_addon: add-on not updated
- maint_window -> eks124: AMI patch applied
- eks124 -> alb_addon: kernel incompatibility
- alb_addon -> consumer: health check fails
- consumer -> msk: lag grows to 47s
- consumer -> dlq: lost events
- msk -> cw_alarm: late alarm 5min
- msk -> dynamo: stream interrupted
- msk -> circuit: no confirmations 20s
- circuit -> gw: rejects transactions
Remediation: Roadmap Governance as Risk Control
The immediate remediation was technical: add-on rollback, reprocessing via DLQ, recalibration of CloudWatch alarms to detect lag growth in 60-second windows with a threshold of SumOfOffsetLag > 5000 (based on historical P99 percentile of nominal lag). But the structural remediation was more important — and harder to sell internally.
Mandatory Risk Register for Deferred ADRs. We implemented a GitHub Actions automation that, upon detecting a PR changing an ADR state to DEFERRED, mandatorily creates an issue in the risk register with fields: severity (derived from the impact declared in the ADR), owner (mandatory, no default), review_by (maximum 30 days for infrastructure ADRs, 90 days for product ADRs), and affected_components (list of impacted AWS services). ADRs without these fields cannot be merged.
Compatibility Matrix as Change Management Gate. We created a Lambda triggered by EventBridge when an EKS maintenance window is scheduled via AWS Systems Manager Change Calendar. This Lambda queries the DynamoDB storing the deferred ADR register and affected components, and automatically opens a mandatory review ticket if there is an intersection. The operational cost is zero — the Lambda executes in under 200ms and DynamoDB uses on-demand billing.
Layered Alarms for MSK. We replaced the single lag alarm with a three-layer strategy: (1) ConsumerGroupLag with P95 percentile threshold in a 1-minute window for early detection; (2) composite alarm correlating lag + consumer error rate + EKS pod CPU; (3) business alarm on transaction approval rate with SLO burn rate alert (error budget consumption in 1h).
The Real Problem: Invisibility of the Deferral Cost
What makes incidents originating in roadmap decisions particularly insidious is that the cost of deferral is deferred in time and distributed across teams — while the benefit of deferral (feature delivery speed) is immediate and concentrated. This asymmetry creates a structural incentive to defer, especially in organizations where engineering metrics are dominated by velocity and lead time, not platform health indicators.
In regulated financial environments, this problem has an additional dimension: unregistered technical debt is, in many compliance frameworks (SOX, BACEN Resolution 4.658, PCI-DSS), an operational risk that should be on the radar of the CISO and CRO, not just the engineering team. When ADR-047 was deferred without a risk register entry, it effectively left the corporate governance radar — and no compensating control was activated.
The solution is not additional bureaucracy — it is making the cost of deferral visible in the same system where the benefit is registered. We implemented a "debt dashboard" in Backstage (our developer experience portal) that shows, for each component, the number of deferred ADRs, the aggregate severity, and the average deferral time. This dashboard is reviewed in quarterly platform reviews with the CTO and product leads. When a critical infrastructure ADR appears with more than 60 days in DEFERRED state, it automatically enters the agenda of the next architecture meeting — not as a suggestion, but as a mandatory item.
The positive side effect: product teams began actively negotiating the timing of infrastructure ADRs, because they now understand that the cost of deferral has an expiration date — and that they will be in the room when that bill arrives.
Well-Architected Assessment: Impacted Pillars
-
security: SEC 10 (Respond to security events). The automatic security patch is the correct behavior — deferring security patches to avoid incompatibilities would be an unacceptable risk trade-off in a financial environment. The problem was the absence of pre-patch compatibility testing in a staging environment with configuration identical to production, including pinned add-on versions. We implemented a pre-maintenance validation pipeline that runs
eksctl utils nodegroup-healthand validates the compatibility of all managed add-ons against the target version before approving the window. - reliability: Failure in REL 6 (Monitor workload resources). The MSK lag alarm was calibrated to detect sustained degradation, not abrupt failure. The fix requires layered alarms with 1-minute windows and correlation between infrastructure metrics (EKS pod restarts, ALB target health) and business metrics (lag, approval rate). Additionally, REL 10 (Manage changes in your workload) was violated: the EKS maintenance window had no rollback runbook for add-on incompatibilities, and the change management process did not include verification of pending ADRs.
Anti-Patterns That Contributed to the Incident
- ADR as retroactive documentation: Treating ADRs as historical records instead of prospective control instruments with owner, deadline, and consequence of non-execution.
-
Add-on versions pinned without update policy: Pinning
aws-load-balancer-controllerto a specific version without an automated compatibility verification process against new EKS and AMI versions. - Single lag alarm without correlation: Monitoring MSK with a single absolute lag alarm in a long window, without correlation with upstream infrastructure metrics (pod health, ALB target registration).
- Staging not mirroring production: Staging environment with different add-on versions than production, making pre-maintenance tests ineffective for detecting incompatibilities.
- Technical debt invisible to business stakeholders: Absence of a mechanism making the accumulated cost of deferred ADRs visible to product owners and leadership, creating incentive asymmetry.
Quantified Incident Impact
- 4h 12min — Total incident duration. From AMI patch application to full payment pipeline recovery
- 40% — Cost of proactive migration vs. incident cost. 2 platform sprints cost ~40% of total engineer-hours + business impact of the incident
- 47s — MSK lag peak on payments.events.raw topic. Growth from ~200ms to 47s in 18 minutes — below the original alarm threshold for 13 minutes
Lessons for Architects: The Roadmap as Attack Surface
Architects in financial-grade environments need to internalize that the technology roadmap is an attack surface — not just for external adversaries, but for the entropy of the system itself. Every prioritization decision that is not recorded with its dependencies, risks, and review conditions is a latent vulnerability waiting for the right external event to materialize.
The most counterintuitive lesson from this incident is that the problem was not technical in nature — it was incentive modeling. The engineering team knew the migration was necessary. The platform team had opened the ticket. The risk was implicitly acknowledged. What was missing was a mechanism that made the cost of deferral concrete and shared between the parties that benefited from deferral (product team, with more features delivered) and the parties that absorbed the risk (platform team, infrastructure on-call).
This has direct implications for how architects should structure the ADR process. A well-designed ADR for high-criticality environments should include: (1) transitive dependency analysis with explicit reference to adjacent component versions; (2) validity conditions — events that invalidate the decision and require immediate review (e.g., "this ADR is invalidated if the EKS version enters end-of-support"); (3) estimated cost of deferral in engineer-hours per sprint of delay; (4) owner with explicit accountability, not just nominal responsibility.
Finally: blameless postmortem does not mean postmortem without structural consequence. The difference is that consequences are systemic — process changes, automations, controls — not personal. This incident resulted in three process changes, two new automations, and a complete revision of our ADR template. No engineer was individually held accountable. The system was improved.
Architect's Note: What I Would Do Differently: If I could go back to the moment ADR-047 was marked as DEFERRED, I would have insisted on a single change: the mandatory creation of a risk register entry with severity derived from the impact declared in the ADR and a maximum review date of 30 days — non-negotiable. The hard-won lesson I have learned over 16 years in financial-grade environments is that unregistered technical debt is not debt — it is hidden risk, and hidden risk in payment systems has a cost that inevitably surfaces at the worst possible moment. The automation that correlates deferred ADRs with maintenance windows cost less than one week of engineering work; the incident it would have prevented cost more than 18 engineer-hours and measurable revenue impact. That is the clearest ROI I have ever calculated in platform architecture.
Verdict: Govern the Roadmap Like You Govern Infrastructure
This incident was not caused by a technical failure — it was caused by a governance failure that materialized technically. The technical remediation (add-on rollback, alarm recalibration, DLQ for reprocessing) was necessary but insufficient without the structural changes to the ADR process and the visibility of technical debt to business stakeholders.
The concrete recommendation is this: treat every deferred ADR as a registered operational risk, with owner, severity, review date, and estimated cost of deferral. Automate the correlation between deferred ADRs and infrastructure change events (maintenance windows, security patches, runtime updates). Make the cost of technical debt visible in the same system where delivery velocity is measured. And calibrate your observability alarms to detect abrupt failures in short windows — not just sustained degradation in long windows.
In financial-grade environments, the roadmap is not a planning artifact. It is a risk contract. Govern it as such.
References and Further Reading
- AWS EKS Best Practices Guide — Reliability
- AWS EKS Add-ons Compatibility Matrix
- Amazon MSK — Monitoring Consumer Lag
- AWS Well-Architected Framework — Reliability Pillar
- AWS Systems Manager Change Calendar
- Architecture Decision Records (ADR) — Michael Nygard
- Google SRE Book — Postmortem Culture
- Backstage — Software Catalog for Developer Experience
Originally published at fernando.moretes.com. By Fernando F. Azevedo — Senior Solutions Architect.



