Why API Breaking Changes Still Reach Production Even With CI/CD

A few years ago I watched a "tiny" API change take down checkout for about forty minutes. The change was a one-liner. The pull request had two approvals. CI was green across the board. And it still broke production, because the thing that actually mattered was never tested.

If you run microservices at any real scale, you have lived some version of this. Let's talk about why it keeps happening even with a mature pipeline, and what the teams who don't keep getting paged do differently.

The Problem

Here's the change that caused the outage. A payments service had a response that looked like this:

{
  "status": "ok",
  "transaction_id": "txn_8842",
  "amount_cents": 4200
}

Someone renamed amount_cents to amount and switched it to a decimal, because "cents is confusing." Cleaner field, better docs. The producing service's tests were updated to match, everything passed, it shipped.

The problem: three downstream services still read amount_cents. One of them was the order service, which now received undefined, multiplied it by a quantity, and wrote NaN into the database. The failures didn't even surface in the payments service. They surfaced two hops away, in a service the original author had never opened.

This is the core issue. A breaking change is not defined by the service that makes it. It's defined by the consumers who depend on it. And the producer's CI pipeline has no idea those consumers exist.

Why Existing Approaches Fail

The natural reaction is "we need more tests." But look at what each layer actually checks.

Unit tests verify the code does what the author intended. The author intended to rename the field. The unit tests were updated to expect amount. They passed because they were testing the new, broken behavior. Green unit tests told us nothing.

Integration tests verify the service works with its own dependencies — its database, its cache, the APIs it calls. They almost never spin up the services that call it. The payments service had no reason to boot the order service in its pipeline, so the incompatibility was invisible.

End-to-end tests can catch this in theory. In practice they're slow, flaky, and incomplete. Nobody has an E2E test for every consumer's every field access. The order service's amount_cents read wasn't in any E2E path that ran on the payments PR. E2E suites also tend to test happy paths through the UI, not the specific data contracts between internal services.

Schema validation in CI feels like the answer, and it's closer. But most teams validate that their OpenAPI spec is well-formed, not that it's compatible with the previous version. A spec that renames a field is still a perfectly valid spec. Linting passes. The document is correct. It's just incompatible.

The gap is structural, not a matter of test coverage. Every one of these layers checks a service against its own expectations. None of them check a service against what its consumers actually depend on. That's the missing layer.

A Better Approach

You need two things the pipeline above doesn't have: a machine-readable contract, and a check that compares the new contract to what consumers rely on — running before the change merges.

There are two ways to get there, and they're complementary.

1. Diff the contract against its own previous version. If you publish an OpenAPI spec, you can compare the PR's spec to the one currently in production and classify the differences. Removing a field, renaming it, tightening a type, adding a required request parameter — these are breaking. Adding an optional field is not. This catches the obvious regressions cheaply and needs zero coordination with other teams.

2. Diff the contract against what consumers actually use. This is consumer-driven contract testing. Each consumer publishes the subset of the API it depends on. The producer's pipeline checks every change against the union of those expectations. If something still reads amount_cents, removing it fails the build — on the producer's PR, before merge.

Here's how that reshapes the flow:

flowchart TD
    A[Producer opens PR] --> B[Generate OpenAPI spec from code]
    B --> C{Diff vs production spec}
    C -->|Backward compatible| D{Check consumer contracts}
    C -->|Breaking change| F[Fail build + report removed fields]
    D -->|All satisfied| E[Merge allowed]
    D -->|Consumer dependency broken| F
    F --> G[Producer notified pre-merge]
    G --> H[Coordinate version or fix]

The tradeoff worth naming: approach 1 is nearly free but only catches self-inconsistency. Approach 2 catches real-world breakage but requires consumers to publish and maintain their contracts, which is organizational work, not just technical. Most teams I've worked with start with the diff (immediate value, no buy-in needed) and layer in consumer contracts for the high-blast-radius services first.

Example

Start with the spec diff, because it pays off on day one. Given the production spec and the PR's spec, a tool like oasdiff classifies every change. The output for our rename looks roughly like this:

$ oasdiff breaking production.yaml pr.yaml

1 breaking changes:

error, in components/schemas/Transaction
  property 'amount_cents' removed from response of
  GET /transactions/{id} (200)

That's the whole outage, surfaced in one command. The interesting part is the OpenAPI definition driving it — note that removing amount_cents and adding amount are two separate changes, and only one of them is dangerous:

# production.yaml
components:
  schemas:
    Transaction:
      type: object
      required: [status, transaction_id, amount_cents]
      properties:
        status: { type: string }
        transaction_id: { type: string }
        amount_cents: { type: integer }

# pr.yaml  — amount_cents gone, amount added
components:
  schemas:
    Transaction:
      type: object
      required: [status, transaction_id, amount]
      properties:
        status: { type: string }
        transaction_id: { type: string }
        amount: { type: number }

Now wire it into CI so it gates the merge instead of living in someone's terminal:

# .github/workflows/api-compatibility.yml
name: API Compatibility
on: pull_request

jobs:
  contract-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Fetch the spec currently in production
        run: |
          curl -sf https://api.internal/openapi.yaml -o production.yaml

      - name: Generate the spec from this PR
        run: ./gradlew generateOpenApiSpec   # or your generator

      - name: Fail on breaking changes
        run: |
          docker run --rm -v "$PWD:/specs" tufin/oasdiff \
            breaking /specs/production.yaml /specs/build/openapi.yaml \
            --fail-on ERROR

--fail-on ERROR is the line that matters. Without it the diff is a report nobody reads; with it, the rename never merges. Run it on every PR, not nightly — the entire point is to catch the change while the author still has context, not eight hours later.

For the consumer-driven side, the mechanics differ by tool (Pact is the common one), but the principle is identical: the producer's pipeline downloads the consumers' recorded expectations and verifies the new build still satisfies them. Same gate, broader truth.

Lessons Learned

A few things I only learned by getting them wrong, mostly while building internal governance tooling for a service fleet that had outgrown anyone's ability to reason about it by hand.

Generate the spec from code, never the reverse. Hand-maintained OpenAPI files drift from the implementation within weeks, and then your compatibility check is comparing two fictions. If the spec comes out of the running code (annotations, reflection, whatever your stack offers), the diff is comparing reality to reality.

"Breaking" needs a precise, boring definition. Our first version flagged every change and developers learned to ignore it inside a week. Alert fatigue kills these tools faster than bugs do. Sit down and write the actual rules: removing a response field is breaking, adding an optional request field is not, changing a type is breaking, adding an enum value is breaking for responses but not requests. Encode that, and trust drops back.

The org problem is harder than the code problem. The diff was a weekend. Getting teams to treat a red compatibility check as a real blocker — that took quarters. The check only works if "the consumer contract failed" carries the same weight as "the tests failed." Until then it's advisory, and advisory checks get clicked through at 6 PM on a Friday.

Version numbers are a promise, not a mechanism. Bumping to /v2 doesn't stop a v1 consumer from breaking; it just gives you somewhere to put the new shape. Something still has to enforce that v1 keeps its contract. The version is the label on the box, not the lock.

Conclusion

Breaking changes reach production because every standard testing layer validates a service against its own expectations, and a breaking change is by definition about someone else's expectations. CI being green means "this service is internally consistent" — which is exactly what you'd see right before the field you renamed takes down three downstream consumers.

Closing that gap doesn't take a rewrite. It takes one new gate: a contract, generated from code, diffed on every PR, configured to actually fail the build. Start with the cheap spec diff today; add consumer-driven contracts for your highest-blast-radius services as you go.

So here's what I'm curious about: how does your team draw the line on what counts as a "breaking" change — and is that definition written down anywhere, or does it live in the head of whoever reviews the PR?