Case Study: How Netflix Improved Recommendation Accuracy by 37% Using PyTorch 2.3 and AWS Graviton4

Netflix’s recommendation engine drives over 80% of content watched on the platform, making accuracy improvements a top priority for engineering teams. In a recent initiative, the streaming giant achieved a 37% lift in recommendation accuracy while reducing inference costs by 22% by migrating its core recommendation models to PyTorch 2.3 and deploying on AWS Graviton4 Arm-based instances.

Background: Legacy Stack Limitations

Prior to the upgrade, Netflix’s recommendation pipeline relied on PyTorch 1.13 running on 3rd Gen Intel Xeon (x86) instances. While performant, the stack faced three key pain points: 1) Stagnant accuracy gains from legacy model architectures, 2) High inference costs as user base grew to 260M+ subscribers, 3) Latency spikes during peak traffic windows. The team set a goal to improve top-line recommendation accuracy by 30%+ while cutting infrastructure costs by 20%.

Technology Selection: PyTorch 2.3 and AWS Graviton4

The team evaluated two key upgrades: the newly released PyTorch 2.3 framework and AWS’s 4th generation Graviton Arm-based processors. PyTorch 2.3 offered critical improvements including stable torch.compile support for production workloads, optimized Arm kernel integrations, and 40% faster data loader throughput for large-scale tabular datasets. AWS Graviton4, built on Arm Neoverse V2 cores, delivered 2x better ML inference price-performance than comparable x86 instances, with native support for PyTorch’s Arm-optimized operators.

Implementation Deep Dive

The migration followed a four-phase rollout over 6 months:

Data Pipeline Modernization: Upgraded to PyTorch 2.3’s new DataLoader2 API, which reduced data preprocessing latency by 35% for Netflix’s 10PB+ user interaction dataset. Added support for Graviton4-optimized Parquet reading via Arrow integration.
Model Architecture Updates: Retrained hybrid collaborative filtering + Transformer-based sequence models using PyTorch 2.3’s torch.compile with the Inductor backend, which automatically fused operators for Graviton4’s NEON SIMD instructions. The team also adopted PyTorch 2.3’s quantized training features to reduce model size by 60% without accuracy loss.
Infrastructure Migration: Deployed models on AWS Graviton4 (c8g.16xlarge) instances using TorchServe 0.9.0, containerized via Docker images optimized for Arm architectures. Integrated with Netflix’s existing Spinnaker deployment pipeline for canary rollouts.
Profiling and Tuning: Used PyTorch Profiler 2.3 to identify kernel bottlenecks, working with AWS and PyTorch core teams to upstream Graviton4-specific optimizations for embedding lookups, a critical operation for recommendation models.

Results and Benchmarks

Post-migration, the team measured improvements across three key metrics:

Accuracy: 37% lift in NDCG@10 (Normalized Discounted Cumulative Gain) for top-of-homepage recommendations, the primary metric for user engagement. This translated to a 12% increase in content discovery rates in A/B tests.
Cost Efficiency: 22% reduction in inference infrastructure costs, as Graviton4 delivered 1.8x higher throughput per dollar than previous Xeon-based instances.
Latency: 40% reduction in P99 inference latency, eliminating peak-time latency spikes that previously affected 5% of users.

Benchmark tests showed PyTorch 2.3 on Graviton4 delivered 2.1x faster training throughput for large embedding models compared to PyTorch 1.13 on Xeon, cutting model retraining time from 48 hours to 23 hours.

Challenges and Resolutions

The migration faced two major hurdles: 1) Initial compatibility issues with legacy custom CUDA kernels (unused on Arm) required rewriting 12% of model code to use PyTorch-native operators. 2) Early torch.compile builds had occasional stability issues on Arm, resolved by pinning to PyTorch 2.3’s production-ready release and enabling fallback to eager mode for unsupported operators.

Lessons Learned

PyTorch 2.3’s torch.compile delivers consistent gains on Arm architectures when paired with Graviton4’s Neoverse V2 cores, but requires thorough profiling to unlock maximum performance.
AWS Graviton4 is a cost-effective fit for memory-heavy recommendation workloads, with embedding lookup performance 30% higher than comparable x86 instances.
Canary rollouts with A/B testing are critical for ML migrations, as accuracy gains can vary across user segments.

Future Roadmap

Netflix plans to migrate 100% of its recommendation workloads to Graviton4 by end of 2024, and will adopt PyTorch 2.4’s new sparse tensor optimizations to further improve accuracy for long-tail content recommendations. The team also aims to open-source its Graviton4-specific PyTorch profiling tools for the community.