Netflix

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix needed to manage the lifecycle of machine learning models across multiple domains and teams at scale, moving beyond their original single-domain personalization focus.

ml-systems microservices
5 min
Netflix

Scaling ArchUnit with Nebula ArchRules

Netflix needed a way to enforce consistent architectural patterns and build standards across tens of thousands of Java repositories in their polyrepo strategy.

microservices general
5 min
Netflix

Scaling Camera File Processing at Netflix

Netflix needed to build a scalable, flexible media file processing pipeline that could handle diverse camera formats, workflows, and production requirements while maintaining quick turnaround times for global content production.

microservices distributed-systems
5 min
Netflix

State of Routing in Model Serving

Netflix needed to design a domain-independent traffic routing system for their ML model serving infrastructure that could handle personalized experiences at scale across multiple domains while maintaining high availability.

microservices load-balancing
5 min
Netflix

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Netflix needed to build reliable operations infrastructure to support live streaming at massive scale, going from one show per month to nine shows per day with tens of millions of concurrent viewers.

microservices observability
5 min
Netflix

How Temporal Powers Reliable Cloud Operations at Netflix

Netflix needed reliable orchestration for business-critical cloud operations across teams like Open Connect CDN and Live reliability, but faced operational challenges as Temporal adoption grew since 2021.

distributed-systems microservices
5 min
Netflix

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

Netflix needed scalable, deep machine-level understanding of every piece of content across an expanding catalog (including live events and podcasts) to power recommendations and discovery, but building separate models per content type and modality doesn't scale.

ml-systems microservices
5 min
Netflix

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Netflix needed to spin up hundreds of containers in seconds to serve streaming traffic, but after modernizing their container runtime, they hit an unexpected performance bottleneck rooted in CPU architecture that impaired container scaling efficiency.

distributed-systems real-time-systems
5 min
Netflix

Optimizing Recommendation Systems with JDK’s Vector API

Netflix's Ranker service had a video serendipity scoring feature (computing how different a title is from a user's watch history) consuming ~7.5% of total CPU per node, creating a significant performance bottleneck at their enormous scale.

ml-systems real-time-systems
5 min