Meta

Friend Bubbles: Enhancing Social Discovery on Facebook Reels

Facebook Reels needed a way to enhance social discovery by surfacing content that friends have interacted with, requiring real-time computation of relationship strength and ranking of friend-engaged content at massive scale.

ml-systems real-time-systems
5 min
Meta

Ranking Engineer Agent (REA): The Autonomous AI Agent Accelerating Meta’s Ads Ranking Innovation

Meta's ads ranking ML experimentation lifecycle required extensive manual intervention from engineers for hypothesis generation, training job launches, failure debugging, and result iteration, slowing down the pace of ranking model innovation.

ml-systems microservices
5 min
Meta

How Advanced Browsing Protection Works in Messenger

Messenger needed to protect user privacy when clicking links in chats while still detecting and warning users about malicious URLs, creating a tension between link safety scanning and end-to-end privacy.

security messaging-queues
5 min
Meta

Patch Me If You Can: AI Codemods for Secure-by-Default Android Apps

Updating security-related APIs across millions of lines of code and thousands of engineers is extremely difficult at scale, especially when a single class of mobile vulnerability can be replicated across hundreds of locations in an Android codebase.

security ml-systems
5 min
Meta

FFmpeg at Meta: Media Processing at Scale

Meta needed to handle massive-scale media processing (encoding, transcoding, filtering) across its family of apps, requiring efficient orchestration of complex audio/video pipelines using FFmpeg at an unprecedented scale.

storage-systems distributed-systems
5 min
Meta

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Meta's large-scale infrastructure relies on jemalloc for memory allocation, but the codebase had accumulated maintenance burden and needed modernization to keep pace with evolving hardware and workload demands.

storage-systems distributed-systems
5 min
Meta

RCCLX: Innovating GPU Communications on AMD Platforms

GPU-to-GPU communication performance on AMD platforms was insufficient for Meta's evolving AI model training workloads, and the standard RCCL library didn't meet the performance and flexibility requirements of their internal workloads.

distributed-systems ml-systems
5 min
Meta

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

Connecting thousands of GPUs across multiple data centers and regions for gigawatt-scale AI training clusters requires seamlessly bridging different network fabrics, which creates massive networking and interconnect challenges.

distributed-systems ml-systems
5 min
Meta

The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It

Agentic (AI-driven) software development produces and ships code so fast that traditional testing frameworks cannot keep pace, leaving bugs uncaught as they land in rapidly evolving codebases.

ml-systems observability
5 min