One Org or Many? The Postmortem Nobody Wants to Write

Sometime in 2023, a mid-sized financial institution I worked with consolidated all its business units under a single AWS Organization to simplify cost governance. Eighteen months later, a Service Control Policy mistake applied at the root node silenced the production payments pipeline for 47 minutes. The postmortem revealed something most architects know intuitively but rarely document rigorously: Organizations topology is not a FinOps decision — it is a blast radius decision.

What Happened: Context and Organizational Pressure

The pressure came from above. The CFO wanted consolidated cost visibility, the CISO wanted a single policy enforcement point, and the platform team — already stretched thin — wanted to reduce the number of landing zone automation pipelines. The obvious solution seemed to be: one Organization, multiple OUs, hierarchical SCPs. The reasoning was defensible on a whiteboard.

The problem started when the security team needed to block access to certain AWS regions for LGPD/GDPR compliance. The SCP was drafted with an aws:RequestedRegion condition denying all regions outside sa-east-1 and us-east-1. The test was run against a sandbox OU. Approval was granted. The deploy went to the root node — not the production OU, the root.

What nobody had explicitly mapped: the payments account used us-east-2 for a DynamoDB Global Tables endpoint serving as a low-latency read replica. The SCP blocked dynamodb:GetItem and dynamodb:Query calls originating from Lambda functions in that account to that region. The Python SDK (boto3) with exponential retry masked the error for roughly 8 minutes before the circuit breaker in the payment authorization service tripped. The alert arrived via a CloudWatch Alarm with a 5xx error threshold on API Gateway — but the runbook pointed to the authorization service, not the data layer.

Incident Timeline

T+00:00 — SCP applied at root node — Security engineer runs aws organizations attach-policy pointing to Root ID instead of the sandbox OU ID. No scope validation in the IaC pipeline (Terraform) blocked the operation — the policy already existed, only the target changed.
T+00:08 — First silent SDK errors — boto3 with max_attempts=5 and exponential backoff begins absorbing AccessDeniedException from DynamoDB calls in us-east-2. The 15s Lambda timeout has not yet been reached. No active alarm.
T+00:11 — Circuit breaker trips in authorization service — The payment authorization service, deployed on EKS with Resilience4j, opens the circuit after 10 consecutive failures in 30s. Transactions begin returning HTTP 503 to API Gateway.
T+00:14 — CloudWatch Alarm fires — Alarm PaymentAPI-5xxRate > 1% sends notification to SNS → PagerDuty. On-call engineer receives the alert. Initial runbook points to the EKS authorization service.
T+00:22 — Initial misdiagnosis — Engineer checks EKS pods, Resilience4j logs, CPU/memory metrics. All normal. Escalates to tech lead. No CloudTrail correlation yet.
T+00:31 — CloudTrail and Organizations correlation — Tech lead queries CloudTrail Lake with an Athena query filtering errorCode = AccessDeniedException in the last 60 minutes. Identifies pattern in DynamoDB calls from us-east-2. Cross-references Organizations API and finds the root-level attach-policy.
T+00:38 — SCP detached from root node — Senior security engineer runs aws organizations detach-policy. Change propagation takes approximately 4 minutes across all affected accounts.
T+00:47 — Service restored — Circuit breaker closes after successful health checks. Error rate returns to < 0.1%. Incident closed. Total duration: 47 minutes of partial degradation, 36 minutes of complete unavailability for new payments.

Root Cause: Unrestricted Blast Radius by Design: The root cause was not the human error of targeting the root node — human errors are inevitable. The root cause was architectural: a single AWS Organization with no isolation boundary between regulated workloads (payments, PCI-DSS) and operational workloads (security, tooling). Any SCP applied at the Root affects all accounts simultaneously, with no staging, no canary, no automatic rollback. The design turned a routine security policy change into a change with organization-level blast radius. In financial-grade environments, this is unacceptable.

Topology: Single-Org Blast Radius vs. Multi-Org Isolation

The diagram compares the pattern that caused the incident (left) with the remediated architecture (right). Red edges indicate unrestricted SCP propagation; green edges indicate isolation boundary with controlled promotion.

🔴 Padrão Problemático — Single Org / Problematic Pattern — Single Org

Root Management Account (security)
SCP: DenyRegion applied at Root (security)
OU: Security Tooling Accounts (security)
OU: Payments PCI-DSS Accounts (compute)
DynamoDB us-east-2 replica (data)

🟢 Padrão Remediado — Multi-Org com Isolamento / Remediated Pattern — Multi-Org

Org: Operations Security & Tooling (security)
Org: Payments PCI-DSS Boundary (compute)
SCP: DenyRegion scoped to Ops Org (security)
SCP: PCI Controls independent lifecycle (security)
RAM / PrivateLink cross-org sharing (network)
CloudTrail Lake centralized (delegated) (data)

Flows

scp-bad -> root: attached at Root
root -> ou-sec: unrestricted inheritance
root -> ou-pay: full blast radius
ou-pay -> ddb-replica: blocked by SCP
scp-ops -> org-ops: isolated scope
scp-pay -> org-pay: independent lifecycle
org-ops -> ram-share: controlled sharing
org-pay -> ram-share: access via PrivateLink
org-ops -> ct-lake: delegated logs
org-pay -> ct-lake: delegated logs

Why the Single-Org vs. Multi-Org Decision Is Fundamentally an Isolation Decision

There is a dominant narrative that multiple Organizations increase operational complexity — and it is partially true. But it obscures what is actually being traded. In a single Organization, the root node and top-level OUs are attack surfaces for policy changes with instant propagation and no native progressive rollback mechanism. AWS Organizations has no concept of a "canary SCP deploy". When you apply an SCP at the Root, it takes effect immediately across all ~100, ~500, or ~2000 accounts under that root.

In financial-grade environments with multiple regulatory regimes — PCI-DSS for payments, SOC 2 for data operations, BACEN 4.893 for cyber resilience in Brazil — the temptation to use a single Organization with specialized OUs is understandable. The operational reality is that PCI-DSS compliance controls require network and policy isolation that is cleaner with a real Organization boundary, not just an OU boundary.

The Organization boundary provides: (1) SCP policies with completely independent lifecycles; (2) separate Management Account credentials, reducing the blast radius of high-privilege credential compromise; (3) consolidated billing still available via AWS Organizations trusted access and cross-account Cost and Usage Reports; (4) CloudTrail Lake with a delegated administrator that can aggregate logs from multiple Organizations into a single S3 data store with Athena, maintaining centralized visibility without policy coupling.

The real cost of multiple Organizations is duplication of landing zone automation — and that cost is addressable with an Account Vending Machine based on Control Tower customizations and shared IaC pipelines via cross-account CodePipeline.

Technical Remediation: What We Changed After the Incident

The remediation was not simply "move payments to a new Organization". That would have taken weeks and required re-onboarding dozens of accounts. The remediation was layered, with immediate impact first and structural refactoring afterward.

Immediate (week 1): We implemented a guardrail SCP at the root node that explicitly denies organizations:AttachPolicy for any target that is the Root ID (r-xxxx) or any tier-1 OU containing production workloads. The condition uses aws:ResourceTag combined with an Environment=Production tag applied to OUs via Organizations tag policy. This does not resolve the structural problem, but adds a protection layer against the specific error that occurred. The Terraform pipeline was updated with a precondition that validates the target type before any attach-policy.

Medium term (months 2-3): We created a second AWS Organization for PCI-DSS workloads. The Management Account of the new org uses MFA with a hardware token and access restricted to two senior engineers. SCPs in the new org are managed by a separate pipeline with mandatory two-reviewer approval via pull request. CloudTrail Lake was configured with a cross-organization Event Data Store using the organizationEnabled + delegated administrator feature, aggregating events from both orgs into a single Athena repository.

Observability: We added an EventBridge rule in each org's Management Account that captures organizations.amazonaws.com events of type AttachPolicy and DetachPolicy and publishes to an SNS topic with subscriptions to the security Slack channel and PagerDuty. MTTD for this type of change dropped from ~31 minutes (the time it took to correlate during the incident) to under 2 minutes in post-implementation validation tests.

Single-Org vs. Multi-Org: Real Trade-offs

Criterion	Dimension	Single Organization	Multiple Organizations
SCP blast radius	Root affects 100% of accounts instantly	Isolated by Organization boundary; changes are independent	—
Operational complexity	Lower: single landing zone pipeline, single Control Tower	Higher: multiple pipelines, multiple Control Tower enrollments	—
Cost visibility	Native via Consolidated Billing	Requires cross-account CUR + Athena or AWS Cost Explorer linked accounts	—
Regulatory isolation (PCI, SOC2)	Possible via OUs, but policy boundary is logical, not physical	Physical boundary between orgs; auditors accept more readily	—
Management Account compromise	One compromised account = potential access to entire organization	Blast radius limited to specific org; other orgs unaffected	—
SCP propagation latency	Seconds to a few minutes across all accounts	Same behavior within each org; orgs are independent	—

The Real Problem with SCPs: No Staging, No Automatic Rollback

One of the most important findings from the postmortem was that the team had treated SCPs like ordinary infrastructure code — with the same deployment pipeline as a security group or IAM role. This is a mental model error.

SCPs are access control policies with immediate propagation and no native progressive rollback mechanism. There is no aws organizations deploy-policy --canary 10%. When you attach-policy to an OU with 200 accounts, all 200 accounts are affected simultaneously. AWS Organizations has no concept of deployment rings for policies.

The practical implication is that the SCP change process must be treated like a production database-level change — with a maintenance window, dual approval, and a tested rollback plan. The rollback plan for an SCP is simple: detach-policy. But if you do not know what policy was in place before, or if the change was composed of multiple operations, rollback may not be trivial.

What we implemented: an immutable state registry of SCPs per OU/Root, stored in S3 with versioning enabled and Object Lock in COMPLIANCE mode for 90 days. Before any attach-policy, the pipeline saves the current state. Automated rollback is a Lambda that reads the previous state from S3 and executes the inverse operations. Automated rollback time in tests was 45 seconds — compared to the 7 minutes it took to identify and manually execute during the actual incident.

A frequently overlooked detail: SCPs with explicit Deny take precedence over any Allow in identity policies, including IAM Role policies with AdministratorAccess. This means that not even the account root user (unless explicitly excluded via aws:PrincipalType: Root) can execute actions blocked by an SCP. In our incident, this is what made the situation so severe — there was no escape hatch in the payments account.

FinOps in Multi-Org: The Argument That Overcomes Resistance

The most common argument against multiple Organizations is the loss of consolidated cost visibility. That argument was valid in 2018. In 2024, it is a solved problem — with some important caveats.

AWS Cost and Usage Report (CUR 2.0) can be configured for delivery to a centralized S3 bucket in a dedicated billing account, even across multiple Organizations, using a cross-account S3 bucket policy pattern with s3:PutObject allowed for the billingreports.amazonaws.com service principal from multiple Management Account IDs. Athena + AWS Glue Crawler over this data produces a unified cost view that the CFO can consume via QuickSight with row-level security per business unit.

What is not natively solved: Reserved Instances and Savings Plans are not shared across Organizations. This is a real cost. In our analysis, the payments account used approximately $18k/month in Compute Savings Plans that, upon moving to a new Organization, could no longer be shared with tooling accounts in the original org. The solution was to consolidate Savings Plans in the new payments org and use On-Demand for tooling workloads with more variable usage — the cost delta was approximately $1.2k/month, which was accepted as the cost of regulatory isolation.

A pattern I recommend: use AWS Cost Categories with rules based on CostCenter and BusinessUnit tags applied via tag policies in both Organizations. This allows financial reporting to be agnostic to Organizations topology — the CFO sees by cost center, not by org.

AWS Well-Architected: Affected Pillars

security: SCPs must have a change management lifecycle equivalent to production database changes. Use aws:PrincipalType: Root as an explicit escape hatch in critical SCPs. Implement EventBridge + SNS for immediate detection of policy attach/detach in the Management Account. Consider Organization boundary as physical isolation for PCI-DSS and BACEN 4.893 regimes.
reliability: Organizations design must minimize the blast radius of operational changes. Circuit breakers in downstream services (EKS/Resilience4j, Lambda with Dead Letter Queue) are necessary but insufficient — they mask the error without resolving the cause. Add specific health checks for AccessDeniedException with low-latency alarms (< 5 minutes MTTD). Implement automated SCP rollback with immutable state in S3 Object Lock.

Anti-Patterns That Lead to the Incident

Treating SCPs as ordinary infrastructure in the CI/CD pipeline without differentiated approval by target level (Root, production OU, sandbox OU)
Using a single Organization to consolidate cost governance without evaluating the blast radius of security policies on regulated workloads
Configuring 5xx alarms on API Gateway as the only detection signal without specific alarms for AccessDeniedException in CloudTrail
Assuming that circuit breakers in downstream services substitute for architectural isolation — they are complementary, not equivalent
Failing to map cross-region dependencies of accounts before applying SCPs with aws:RequestedRegion conditions
Relying on OUs as regulatory isolation boundaries for PCI-DSS auditors without explicitly documenting that the boundary is logical, not physical

Architect's Note: After this incident, I started recommending a simple rule: if you have workloads with distinct regulatory regimes (PCI-DSS, SOC 2, BACEN) or with availability SLOs above 99.9%, they belong in separate Organizations — not separate OUs. The additional operational cost of multiple Organizations is real, but it is a predictable and manageable engineering cost; the cost of a 47-minute incident on a payments pipeline is not. The hardest lesson was realizing we treated SCPs as code when we should have treated them like production database schema changes: with staging, dual approval, tested rollback, and a maintenance window. That is not bureaucracy — it is reliability engineering applied to the control plane.

Verdict: When to Use One or Multiple Organizations

Use a single AWS Organization when: all your workloads share the same regulatory regime, the same availability SLO, and the platform team has capacity to implement rigorous guardrails in the SCP change pipeline. Use multiple Organizations when: you have distinct regulatory regimes (especially PCI-DSS or BACEN 4.893), SLOs above 99.9% on critical workloads, or when auditors require evidence of physical policy isolation. The cost of non-shared Savings Plans is quantifiable and generally lower than the cost of a single incident caused by unrestricted blast radius. The decision is not about operational simplicity — it is about where you accept that inevitable human error has consequences.

References

Originally published at fernando.moretes.com. By Fernando F. Azevedo — Senior Solutions Architect.

One Org or Many? The Postmortem Nobody Wants to Write

What Happened: Context and Organizational Pressure

Incident Timeline

Topology: Single-Org Blast Radius vs. Multi-Org Isolation

🔴 Padrão Problemático — Single Org / Problematic Pattern — Single Org

🟢 Padrão Remediado — Multi-Org com Isolamento / Remediated Pattern — Multi-Org

Flows

Why the Single-Org vs. Multi-Org Decision Is Fundamentally an Isolation Decision

Technical Remediation: What We Changed After the Incident

Single-Org vs. Multi-Org: Real Trade-offs

The Real Problem with SCPs: No Staging, No Automatic Rollback

FinOps in Multi-Org: The Argument That Overcomes Resistance

AWS Well-Architected: Affected Pillars

Anti-Patterns That Lead to the Incident

Verdict: When to Use One or Multiple Organizations

References

Tags

Author

Stats

Published

You Might Also Like

AWS Lambda MicroVMs: technical review of a new serverless primitive

EC2 G7 & NVIDIA Blackwell: GPU Inference Architecture for Production

ECS Auto Scaling: High-Resolution Metrics vs. Traditional Scaling Approaches

Oracle HA on AWS: FSx for ONTAP as a Lever for Gradual Modernization

How AWS Cloud Practitioner Essentials Helped Our Team Build Cloud Confidence