Postmortem: Jaeger 1.49 Bug Caused Lost Traces for 1 Hour During Payment Outage

Published: October 12, 2023 | Status: Resolved

Executive Summary

On October 10, 2023, between 14:00 and 15:00 UTC, a bug in Jaeger 1.49's trace ingestion pipeline caused complete loss of distributed traces for our payment processing service. The outage coincided with a critical payment gateway failure, leaving engineering teams without observability data to debug the primary incident. The bug was identified and patched within 45 minutes of detection, with trace ingestion restored by 15:00 UTC. No permanent data loss occurred outside the 1-hour window.

Timeline of Events

14:00 UTC: Payment gateway provider reports elevated 5xx error rates for our merchant ID.
14:05 UTC: On-call engineers trigger incident response, attempt to pull Jaeger traces for payment service to identify failure patterns.
14:07 UTC: Engineers notice no new traces appearing in Jaeger UI for payment service, despite high request volume.
14:15 UTC: Initial investigation rules out Jaeger storage (Elasticsearch) issues: storage cluster is healthy, no write errors.
14:22 UTC: Team identifies Jaeger Collector pods running version 1.49 are rejecting 100% of incoming trace batches with "400 Bad Request" errors.
14:30 UTC: Bug triage confirms regression in Jaeger 1.49's protobuf deserialization logic for batch trace payloads larger than 1MB.
14:35 UTC: Temporary fix deployed: downgrade Jaeger Collector to 1.48.2, which is known stable.
14:45 UTC: Trace ingestion restored, all new traces visible in Jaeger UI.
15:00 UTC: Payment gateway issue resolved by provider, full service recovery confirmed.

Root Cause Analysis

Jaeger 1.49 introduced a change to the Collector's gRPC ingestion path to support a new OpenTelemetry trace format field (otlp.trace.batch.additional_attributes). The implementation included a flawed bounds check for protobuf-encoded batch payloads: when a trace batch exceeded 1MB in size, the deserializer incorrectly calculated the remaining buffer length, leading to a panic that was caught by the gRPC server's default error handler and returned as a 400 Bad Request to all clients.

Our payment service generates large trace batches during peak traffic (average 1.2MB per batch) due to high span counts per payment transaction (including 12+ downstream service calls: fraud check, ledger, notification, etc.). This meant 100% of payment service trace batches triggered the bug, while lower-traffic services with smaller batches were unaffected.

The bug was not caught in staging because our staging payment service load tests used batch sizes capped at 800KB, below the 1MB threshold. Production traffic exceeded this limit consistently during peak hours.

Impact Assessment

Total impact window: 1 hour (14:00 – 15:00 UTC). Key impacts:

Observability Loss: 0% of payment service traces ingested during the window, leaving no distributed tracing data to debug the concurrent payment gateway outage.
Incident Response Delay: 20-minute delay in identifying the root cause of the payment gateway issue, as engineers relied on application logs and metrics alone (no trace correlation for latency spikes).
Customer Impact: 0.8% of payment transactions failed during the window (attributed to the gateway outage, not the Jaeger bug). No customer data was lost or corrupted.
Compliance: No violation of PCI-DSS or SOC2 controls, as trace data is non-sensitive and retention policies allow for 1-hour gaps in non-critical observability data.

Remediation Steps

Immediate actions taken during the incident:

Downgraded Jaeger Collector deployment from 1.49 to 1.48.2, which does not include the flawed deserializer change.
Validated trace ingestion for all services, confirmed payment service traces are flowing.
Coordinated with payment gateway provider to resolve their 5xx error issue.

Post-incident patches:

Jaeger upstream was notified of the bug via GitHub issue #4892, with a reproducer and proposed fix for the bounds check logic.
Jaeger 1.49.1 was released 3 days post-incident, including the fix for the protobuf deserialization bug.
Our team upgraded to Jaeger 1.49.1 after 2 weeks of staging validation.

Prevention Measures

To prevent recurrence, the following changes were implemented:

Staging Load Test Updates: Updated staging payment service load tests to generate production-like trace batch sizes (1.2MB+) to catch size-based regressions.
Jaeger Version Policy: Added a 2-week staging validation period for all Jaeger version upgrades, with explicit checks for trace ingestion across all high-traffic services.
Observability Redundancy: Deployed a secondary trace collection pipeline using Grafana Tempo as a fallback, to eliminate single points of failure for critical service observability.
Alerting Improvements: Added an alert for Jaeger Collector 4xx error rates >1%, with automatic paging to the on-call team, to reduce time to detection for ingestion issues.

Conclusion

This incident highlighted the risk of relying on a single observability tool for critical service debugging, especially during concurrent outages. The Jaeger 1.49 bug was a regression that slipped past staging tests due to unrealistic load parameters, but quick downgrade and upstream collaboration resolved the issue with minimal customer impact. The prevention measures implemented will harden our observability stack against similar regressions in the future.