How Uber Built Its Big Data System — From a Few TBs to 350 Petabytes with Sub-Hour Latency

TL;DR: Uber's data platform handles 6 trillion rows/day, 350 PB of storage, and 4 million analytical queries/week. They got there by rebuilding their entire architecture from scratch — not once, but three times. This post breaks down why, how, and what every data engineer can learn from it.

What is Uber's Big Data Platform?

Uber's Big Data Platform is a distributed data infrastructure that processes over 350 petabytes of data, ingests 6 trillion rows daily from hundreds of microservices, and serves 4 million analytical queries per week with sub-30-minute latency — built on HDFS, Apache Hudi, Apache Spark, and Presto.

Every ride you book on Uber leaves a digital footprint. Not just one row — but dozens: pickup location, dropoff, driver rating, ETA calculation, surge pricing events, payment processing, trip status updates. That's just one ride, for one user, in one city.

Multiply that by millions of daily rides across hundreds of cities, add Uber Eats, Uber Freight, and constantly launching new features — the result is 6 trillion rows ingested every day, totaling 350 petabytes of storage.

The central question: When a company grows at Uber's breakneck speed, how do you design a data system so it doesn't get crushed by its own data?

Uber's journey is a story of continuously recognizing the limits of old architectures and bravely re-architecting from scratch — not once, but three times.

Why Uber Needed Non-Standard Big Data

Uber's Unique Data Characteristics

Most companies' data is append-only: logs are written and never modified. But Uber's data is inherently mutable:

A driver rating might be updated the next day.
A fare might be adjusted days later due to a dispute.
A backfill might occur weeks later due to business rule changes.

This is the crux of why standard industry solutions were insufficient. While the rest of the industry was building append-only data lakes, Uber fundamentally needed something else.

Furthermore, Uber serves three user groups with completely different needs — simultaneously, on a single platform:

City Ops (thousands of users): Need simple, near-real-time SQL for daily operational decisions.
Data Scientists/Analysts (hundreds of users): Need full datasets for long-term modeling and forecasting.
Engineering Teams (hundreds of users): Need data for automated pipelines such as fraud detection and driver onboarding.

The Scale of the Problem (as of 2024–2025)

19,500 managed Hudi datasets
6 trillion rows ingested daily / 3 million new data files daily
350 petabytes across HDFS and Google Cloud Storage
350,000 commits daily / 70,000 daily table service operations
4 million analytical queries/week via Presto and Spark

Generation 1: "Warehouse-as-Lake" and Its Costs

Initial Architecture (Pre-2014)

Before 2014, Uber's data was scattered across various OLTP databases — MySQL, PostgreSQL. Any engineer needing data had to know exactly which database to query and write custom code to join across sources. Total volume was just a few terabytes.

This worked when Uber was small. But as the business began to scale exponentially, the lack of consistency became a disaster.

Architecture Pre-2014

Gen 1: Data Warehouse with Vertica

The first solution was building a centralized data warehouse. Uber chose Vertica for its columnar design — fast and scalable compared to other options at the time.

The architecture was simple: multiple ad hoc ETL jobs copied data from various sources (AWS S3, OLTP databases, service logs) into Vertica. SQL was standardized as the single interface.

The initial result was a massive success — for the first time, everyone had a global view. Hundreds of new users emerged. City operators started using SQL for decision-making. New ML and experimentation teams formed.

Uber: BigData Platform Gen 1

When Success Becomes a Burden

However, this popularity was also the beginning of the problem:

Schema drift was the first issue. Source data was primarily in JSON — a flexible format without strict schema enforcement. If a team quietly changed their data structure (e.g., renaming a column), downstream ETL pipelines would break without warning.

Duplicate ingestion complicated everything. Because ETL jobs were ad hoc, the same dataset could be ingested multiple times with different transformations. Multiple versions of the "truth" co-existed, and nobody knew which was correct.

Scaling costs became irrational. The warehouse wasn't horizontally scalable. As data grew, Uber had to delete old data to make room — an analytics system that couldn't retain history.

The warehouse became a single point of failure. Everything ran on Vertica. This was no longer a data warehouse; it was a data monolith.

Generation 2: Hadoop Data Lake and the Lesson of Incremental Processing

Re-architecting Around the Hadoop Ecosystem

The core decision of Gen 2 was radical: completely separate raw ingestion from data modeling.

The design principles were simple but crucial:

Raw data enters Hadoop once, with no transformation during ingestion (EL, not ETL).
All transformations occur inside Hadoop via horizontally scalable batch jobs.
Only processed data tables are pushed to Vertica — for city ops requiring fast SQL.

Uber: BigData Platform Gen 2

Technology Stack and Rationale

Uber built Gen 2 on four main technologies:

Technology	Role
Apache Parquet	Columnar storage format enabling efficient compression and faster querying
Apache Presto	Serves instant ad-hoc queries (results in seconds, not hours)
Apache Spark	Programs complex data pipelines, supporting both SQL and non-SQL
Apache Hive	"Workhorse" specializing in processing massive batch queries

The New Limit: The Snapshot-Based Ingestion Problem

Gen 2 brought an inherent limitation: snapshot-based ingestion.

Because HDFS and Parquet do not support in-place updates, every ingestion job had to recreate an entire snapshot of the dataset. The consequences:

Data latency reached up to 24 hours. For a real-time business like Uber, this was unacceptable.
Irrational waste of compute. A 100TB table with only 100GB changing daily still required scanning the entire 100TB — over 99.9% of compute resources wasted.
HDFS NameNode Bottleneck. Too many small files from ad-hoc jobs began overloading the NameNode.

By early 2017, with over 100 Petabytes stored and 100,000 vcores, the system had hit its scalability limits.

Apache Hudi is Born: When There is No Solution, Build It

Uber: BigData Platform Gen 3

A Problem with No Market Solution

Uber needed three things simultaneously:

Upsert operations on HDFS/Parquet — which is append-only storage.
Incremental reads — reading only changed data since the last checkpoint.
Sub-hour latency instead of 24 hours.

There was no technology on the market in early 2017 that could simultaneously meet all three requirements. Uber was forced to build it themselves.

How Hudi Works

Hadoop Upserts anD Incremental (Hudi) is a Spark library that creates an abstraction layer over HDFS and Parquet — adding upsert and incremental capabilities.

The central mechanism is Copy-on-Write (CoW): when a record needs an update, Hudi rewrites the entire Parquet file containing that record. Write costs increase, but reading remains extremely fast because there is no complex merge logic required.

Hudi provides two ways to read data:

Snapshot view: Returns a holistic view of the entire table at a point in time.
Incremental view: Returns only records inserted or updated since a specific checkpoint timestamp.

Result: Data latency dropped from 24 hours to under 1 hour.

Standardized Data Model: Two Types of Tables

Uber standardized raw data organization into two primary table types:

Changelog History Table: Stores the entire history of changelogs from upstream systems. Can be sparse — upstream systems may only send partial rows for changed columns.
Merged Snapshot Table: Provides a unified, up-to-date view — all columns present for a given key, regardless of update history complexity.

Marmaray: The Ingestion Platform of Gen 3

Hudi solved the storage layer problem. But what system would feed data into Hudi?

Uber built Marmaray — a generic data ingestion platform operating on a mini-batch model running every 10–15 minutes, consuming changelogs from Apache Kafka and applying changes to Hadoop via Hudi.

Uber established an unbreakable rule: absolutely no data transformation during ingestion. Marmaray is EL (Extract-Load), not ETL. Raw data must remain 100% intact upon entering Hadoop. Result: 30-minute raw data latency end-to-end.

Evolution: From Marmaray to Apache Flink (IngestionNext)

While Marmaray reduced latency to 30 minutes, it eventually revealed new limits. For highly time-sensitive workloads, 30 minutes is still too long. Uber launched IngestionNext — shifting from Marmaray (Spark) to Apache Flink for streaming-native ingestion.

Result: Data latency dropped from 30 minutes to under 15 minutes, while significantly cutting hardware resource consumption.

Comparison of the 3 Architectural Generations

	Gen 1 (Vertica)	Gen 2 (Hadoop)	Gen 3 (Hudi)
Scale	Few TBs	~100 PB	350 PB
Latency	Minutes	24 hours	<30 minutes
Upsert support	✅	❌	✅
Horizontal scale	❌	✅	✅
Incremental reads	❌	❌	✅

Hudi at Trillion-Record Scale: 3 Engineering Breakthroughs

Workload Classification at True Scale (2024)

Uber divides its 19,500 datasets into four workload classes:

Append-Only (11,200 tables): Largest volume, bulk insert. Key metric: ingestion speed.
Upsert-Heavy (4,400 tables): Mutable business states with high update frequency. Key metric: write throughput.
Derived (1,600 tables): ETL and ML pipelines using incremental reads. Key metric: correctness.
Realtime Flink-Native Streaming (500 tables): SLO for freshness under 15 minutes. Key metric: latency.

Breakthrough 1: Metadata Table (MDT) — Solving the File Listing Problem

The problem: With tens of thousands of datasets, each with thousands of partitions containing millions of files — even a simple file listing became a bottleneck. The HDFS NameNode was completely overwhelmed.

The solution: MDT is a key-value store managed directly by Hudi using HFile (SSTable-based) to store:

File listing metadata per partition
Column-level statistics (min/max per Parquet file) for query pruning
Bloom filters for fast record lookups during upserts

Algorithmic complexity dropped from O(n) linear scans to O(1) per lookup.

Production impact: Deployed across 90%+ of datasets. HDFS NameNode bottleneck completely eliminated.

Breakthrough 2: Record Index (RI) — Upserts on Trillion-Row Tables

The problem: To perform an upsert, the system must identify exactly which Parquet file contains the record to be updated. At hundreds of billions of rows, Bloom filters and file scans are insufficient.

Uber initially tried HBase as an external index — powerful but operationally complex, creating a potential single point of failure.

The solution: Record Index (RI) — an HFile-backed structure inside the MDT — no external servers needed. Maps each record key to a file group with O(1) complexity.

Real production numbers:

3,600 large tables use RI
Largest tables exceed 300 billion rows: indexes sharded into 10,000 HFiles
Lookup latency: 1–2ms per record key
Initialization of a 300B-row table: ~7 hours using ~4,000 executors
Roadmap: Scaling to 1.2 trillion row tables

Breakthrough 3: Multi-Data-Center Reliability

Requirements: Mission-critical datasets must remain highly available and consistent across multiple regions. A single-region outage must not bring down the entire analytics platform.

Architecture:

Primary dataset with a replicated secondary in a different geographical region
Hudi's commit timeline and atomic operations safely propagate all writes to the secondary region
Table availability service monitors regional health and automatically promotes the secondary to primary on disruption
Intelligent query routing: Presto and Spark queries automatically routed to the closest/healthiest region

Uber has also successfully migrated part of its data lake to Google Cloud Storage (GCS) with zero downtime.

Key Takeaways for Data Engineers

There is no perfect first architecture. Nobody chooses perfectly when scale changes exponentially. The goal is to choose something horizontally scalable enough to buy time for the next generation.
Understand the nature of your data first. Uber's entire Gen 2→3 migration was driven by one insight: their data is mutable, not append-only. The wrong assumption about data mutability led to 24-hour latency and massive compute waste.
EL before ETL during ingestion. Marmaray's "no transformation during ingestion" rule gave Uber the flexibility to re-run transformation steps without re-ingesting from source — a massive operational advantage.
When there's no solution, the best companies build their own. Apache Hudi is now used by 65% of the Fortune 500 — born entirely from Uber's specific engineering problem.

References

If you found this useful, I write about data engineering, AI, and large-scale systems on my Substack. Subscribe to get notified when I publish next.