AWS

Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events

How to design systems that can recover from ransomware and destructive cyberattacks when backups, credentials, and infrastructure components have been compromised.

security storage-systems
4 min
AWS

How ALS GeoAnalytics LITHOLENS ™ revolutionizes core logging through machine learning with Amazon EKS

ALS GeoAnalytics needed to scale machine learning model training and inference for core logging analysis while managing computational costs effectively.

distributed-systems ml-systems
3 min
Airbnb

Scaling Airbnb’s identity graph with a unified knowledge graph infrastructure

Airbnb needed to scale their identity graph infrastructure to efficiently resolve user identities and understand relationships between entities across their platform.

databases distributed-systems
5 min
Cloudflare

Announcing Claude Managed Agents on Cloudflare

Enabling developers to deploy and scale autonomous agent workflows globally while maintaining security isolation and control over access to private backend systems.

distributed-systems security
4 min
Google

An important update: Transitioning Gemini CLI to Antigravity CLI

Google needed to unify fragmented AI terminal tooling by consolidating the community-focused Gemini CLI into a more scalable, agent-first platform capable of handling complex multi-agent workflows.

api-design microservices
5 min
Google

Build Long-running AI agents that pause, resume, and never lose context with ADK

Building production-grade AI agents that can maintain context and state across long-running enterprise workflows spanning days or weeks without losing information during idle periods or server restarts.

api-design distributed-systems
5 min
Google

Empowering Service Providers and Hardware Partners with Gemini for Home

How can Google enable third-party service providers and hardware manufacturers to build intelligent smart home experiences without requiring deep AI/ML expertise or significant R&D investment?

api-design ml-systems
5 min
Google

MaxText Expands Post-Training Capabilities: Introducing SFT and RL on Single-Host TPUs

Enabling efficient post-training of large language models on single-host TPU configurations without requiring complex multi-host distributed setups.

ml-systems distributed-systems
5 min
Google

One Year of Innovation: Celebrating 100k Members in the Google Cloud x NVIDIA Developer Community

Developers needed accessible infrastructure, resources, and structured learning pathways to effectively build and optimize AI applications using GPUs and large language models at scale.

api-design ml-systems
5 min
Google

Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket

AI training pipelines were bottlenecked by slow data I/O when accessing training datasets stored in Google Cloud, limiting throughput and increasing total training time.

storage-systems ml-systems
5 min
Google Cloud

Agent Factory Recap: How Gemma 4 Taught Itself Physics

How to deploy high-intelligence AI models with agentic capabilities to consumer hardware and mobile devices without requiring cloud infrastructure.

ml-systems distributed-systems
5 min
Google Cloud

Building Event-Driven Data Agents with BigQuery, Pub/Sub, and ADK

Enterprise systems need to react to events in real-time rather than relying on slow batch jobs or inefficient polling microservices that create dangerous delays in detecting critical issues like fraud or supply chain disruptions.

real-time-systems messaging-queues
5 min
Google Cloud

Cloud Engineer’s AI Toolkit: Sign up Now for a Developer Workshop Near You!

Organizations need to securely build, deploy, and govern autonomous AI agents at enterprise scale as the industry transitions from experimental LLMs to production agentic AI systems.

ml-systems security
5 min
Google Cloud

Five must-have guides to move agents into production with Gemini Enterprise Agent Platform

Deploying and managing AI agents at scale in production requires infrastructure for state management, security governance, and complex workflow orchestration that goes beyond demo implementations.

distributed-systems security
5 min
Google Cloud

How BASF manages thousands of supply chain decisions with AlphaEvolve’s agentic algorithms

BASF needed to manage and optimize thousands of interdependent supply chain decisions across 180 global production sites where weather and regulatory changes can cause cascading disruptions in a two-year production pipeline.

distributed-systems ml-systems
5 min
Google Cloud

Introducing Gemini Enterprise Agent Platform, powering the next wave of agents

Building safe, reliable, and autonomous agents that can act independently across multiple enterprise systems while maintaining security, governance, and reliability guardrails.

ml-systems security
5 min
Google Cloud

Migrating to Google Cloud’s Application Load Balancer: A practical guide

Migrating business-critical load balancer configurations from on-premises hardware solutions to Google Cloud while preserving existing traffic manipulation logic.

load-balancing distributed-systems
5 min
Google Cloud

Next '26 Hands-On: 10 Codelabs to Build Featured Tech

How to help developers transition from understanding AI concepts to building and maintaining production agentic systems in cloud environments.

observability microservices
5 min
Google Cloud

Next ‘26: Redefining security for the AI era with Google Cloud and Wiz

Organizations need to secure their AI systems and infrastructure against emerging AI-era threats while maintaining the ability to leverage AI's potential at scale.

security distributed-systems
5 min
Google Cloud

What’s new with the Cross-Cloud Network at Next ‘26

Enabling seamless connectivity, governance, and security across multi-agent AI systems and core applications distributed globally at planet scale.

distributed-systems microservices
5 min
AWS

Building hybrid multi-tenant architecture for stateful services on AWS

Building a multi-tenant architecture that isolates tenants without requiring separate AWS accounts while maintaining stateful service deployments.

load-balancing distributed-systems
5 min
AWS

Choosing between single or multiple organizations in AWS Organizations

Organizations must determine whether to operate under a single AWS organization or split into multiple organizations based on their operational, security, and scaling requirements.

security distributed-systems
4 min
Airbnb

Viaduct 1.0 and the future of Airbnb’s data mesh

Airbnb needed to transition Viaduct from an internal-only data mesh tool to a production-ready, community-driven platform with a stable public API.

api-design distributed-systems
5 min
Cloudflare

Browser Run: now running on Cloudflare Containers, it’s faster and more scalable

Browser Run needed higher usage limits, better performance, and improved reliability while increasing development velocity for their browser automation service.

distributed-systems load-balancing
3 min
Cloudflare

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.

databases observability
4 min
Meta

Labyrinth 1.1: Making End-to-End Encrypted Backups Even More Reliable

Ensuring end-to-end encrypted messages and conversation history survive device loss, device switches, and extended offline periods without compromising encryption guarantees.

storage-systems security
5 min
Meta

Migrating Data Ingestion Systems at Meta Scale

Meta needed to migrate their legacy data ingestion system to a new architecture while maintaining reliability and consistency for real-time social graph snapshots at massive scale.

distributed-systems storage-systems
5 min
Meta

Reel Friends: Building Social Discovery that Scales to Billions

Building a social discovery system that efficiently surfaces Reels watched and reacted to by friends while scaling to billions of users.

caching distributed-systems
5 min
Stripe

Five vertical SaaS insights from Sessions 2026

Vertical SaaS platforms needed to expand their service offerings beyond pure software to include integrated payments, financial services, and agentic commerce capabilities to build more defensible and durable businesses.

api-design distributed-systems
3 min
Airbnb

Monitoring reliably at scale

Designing monitoring and observability systems that remain functional and reliable even when the core infrastructure they monitor is failing or degraded.

observability distributed-systems
5 min
Cloudflare

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Rapidly detect, investigate, and mitigate a critical Linux kernel privilege escalation vulnerability across a global edge computing fleet without impacting customers.

security distributed-systems
4 min
Cloudflare

When DNSSEC goes wrong: how we responded to the .de TLD outage

When DENIC published invalid DNSSEC signatures for the .de TLD, DNS resolvers like 1.1.1.1 faced a critical decision: reject all .de domain queries due to signature validation failures or serve potentially stale cached responses to maintain availability.

caching distributed-systems
4 min
Netflix

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix needed to manage the lifecycle of machine learning models across multiple domains and teams at scale, moving beyond their original single-domain personalization focus.

ml-systems microservices
5 min
Netflix

Powering Multimodal Intelligence for Video Search

Netflix needed to efficiently extract and surface key moments from hundreds or thousands of hours of raw video footage for editorial teams to accelerate the creative content production process.

ml-systems search
5 min
Netflix

Scaling Camera File Processing at Netflix

Netflix needed to build a scalable, flexible media file processing pipeline that could handle diverse camera formats, workflows, and production requirements while maintaining quick turnaround times for global content production.

microservices distributed-systems
5 min
Netflix

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events

Netflix needed to optimize bandwidth utilization and video quality for live streaming events at global scale by moving from constant bitrate to variable bitrate encoding.

real-time-systems distributed-systems
5 min
Netflix

State of Routing in Model Serving

Netflix needed to design a domain-independent traffic routing system for their ML model serving infrastructure that could handle personalized experiences at scale across multiple domains while maintaining high availability.

microservices load-balancing
5 min
Netflix

Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale

Query performance degradation at massive scale (10+ trillion rows, 15M events/second) where repeated identical queries were consuming excessive resources and impacting latency.

caching databases
5 min
Netflix

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Netflix needed to build reliable operations infrastructure to support live streaming at massive scale, going from one show per month to nine shows per day with tens of millions of concurrent viewers.

microservices observability
5 min
Spotify

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations (Honk, Part 4)

Spotify needed to migrate thousands of downstream datasets when source datasets changed structure, without manually updating each consumer application.

data-pipelines microservices
4 min
Spotify

Inside the Archive: The Tech Behind Your 2025 Wrapped Highlights

How to identify and surface the most interesting and meaningful listening moments from a year's worth of user streaming data to create personalized narrative highlights for Wrapped.

data-pipelines ml-systems
4 min
Spotify

Our Multi-Agent Architecture for Smarter Advertising

Spotify needed to optimize ad targeting and delivery at scale by coordinating multiple specialized systems to make smarter advertising decisions rather than relying on monolithic ad selection logic.

microservices ml-systems
4 min
Stripe

10 things we learned building for the first generation of agentic commerce

Building reliable payment and commerce systems that can handle autonomous AI agents as buyers, which introduce new failure modes and consistency requirements not present in traditional e-commerce.

api-design distributed-systems
4 min
Stripe

Analyzing first-party fraud trends: Account, free trial, and refund abuse

Detecting and preventing first-party fraud at scale across a payment network where legitimate users abuse policies through multiple accounts, free trial cycling, and refund exploitation.

ml-systems security
4 min
Stripe

Everything we announced at Sessions 2026

How to make payment infrastructure more programmable while maintaining reliability across a global distributed network and enabling new use cases like AI economic infrastructure.

api-design distributed-systems
3 min
Stripe

Giving agents the ability to pay

Enable autonomous agents to programmatically access payment instruments and execute transactions without requiring human intervention or direct card/account access.

api-design security
4 min
Stripe

Introducing the Machine Payments Protocol

Enable autonomous agents and machines to initiate and complete payments programmatically over the internet without requiring human intermediation.

api-design distributed-systems
4 min
AWS

Deloitte optimizes EKS environment provisioning and achieves 89% faster testing environments using Amazon EKS and vCluster

Deloitte needed to significantly reduce the time required to provision and spin up testing environments for their Kubernetes workloads.

distributed-systems microservices
3 min
Airbnb

Skipper: Building Airbnb’s embedded workflow engine

How to build a durable workflow execution engine that can recover from failures mid-process without losing state or duplicating work.

distributed-systems databases
5 min
Cloudflare

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

Cloudflare needed to make their global edge infrastructure more resilient to configuration changes and prevent widespread outages caused by unsafe deployments.

distributed-systems observability
4 min
Cloudflare

Introducing Dynamic Workflows: durable execution that follows the tenant

Enable multi-tenant platforms to execute millions of unique, durable workflows without incurring significant idle infrastructure costs.

distributed-systems microservices
4 min
Cloudflare

Post-quantum encryption for Cloudflare IPsec is generally available

Protecting IPsec communications from future quantum computing threats while maintaining current interoperability with existing infrastructure.

security distributed-systems
3 min
Cloudflare

Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions

How to measure, analyze, and publicly report on Internet disruptions caused by geopolitical events, infrastructure attacks, and power outages in real-time across global networks.

observability distributed-systems
4 min
Meta

How Meta Is Strengthening End-to-End Encrypted Backups

How to enable end-to-end encrypted backups for messaging applications while ensuring recovery codes remain inaccessible to Meta, cloud providers, and other third parties.

security storage-systems
5 min
AWS

PACIFIC enables multi-tenant, sovereign product carbon footprint exchange on the Catena-X data space using AWS

Enable multiple independent organizations to securely exchange Product Carbon Footprint (PCF) data within a shared data space while maintaining data sovereignty and tenant isolation.

microservices security
4 min
Airbnb

Building a fault-tolerant metrics storage system at Airbnb

Building a metrics storage system capable of ingesting 50 million samples per second while reliably storing 2.5 petabytes of time series data at scale.

observability storage-systems
5 min
Cloudflare

Building the agentic cloud: everything we launched during Agents Week 2026

How to enable developers to build and deploy AI agents at scale across a distributed edge computing network while maintaining security and providing necessary infrastructure tools.

distributed-systems security
4 min
Cloudflare

Moving past bots vs. humans

Traditional bot detection mechanisms are becoming ineffective as AI assistants and privacy proxies blur the distinction between legitimate users and automated abuse.

security api-design
4 min
Cloudflare

Agents Week: network performance update

Cloudflare needed to improve request handling performance across its global network to maintain competitive advantage over other CDNs.

distributed-systems load-balancing
4 min
Cloudflare

Artifacts: versioned storage that speaks Git

Providing agents, developers, and automations with scalable, Git-compatible versioned storage that can handle tens of millions of repositories without forcing them to manage infrastructure.

storage-systems api-design
4 min
Cloudflare

Building the foundation for running extra-large language models

How to efficiently run inference for extra-large language models on edge infrastructure while maintaining low latency and high throughput across distributed Cloudflare servers.

ml-systems distributed-systems
4 min
Cloudflare

Cloudflare Email Service: now in public beta. Ready for your agents

Enabling AI agents to send, receive, and process email natively as a multi-channel communication medium without requiring developers to build custom email infrastructure.

api-design microservices
4 min
Cloudflare

Introducing Flagship: feature flags built for the age of AI

Third-party feature flag services introduce unacceptable latency for applications requiring sub-millisecond flag evaluation at global scale.

caching distributed-systems
4 min
Cloudflare

Project Think: building the next generation of AI agents on Cloudflare

Building a scalable platform for deploying AI agents at the edge that can think, act, and persist state across distributed Cloudflare infrastructure.

distributed-systems ml-systems
3 min
Cloudflare

Rearchitecting the Workflows control plane for the agentic era

Cloudflare Workflows needed to support higher concurrency and creation rate limits to enable durable background agents at scale.

distributed-systems rate-limiting
4 min
Cloudflare

Unweight: how we compressed an LLM 22% without sacrificing quality

GPU memory bandwidth constraints were limiting LLM inference efficiency across Cloudflare's distributed edge network, requiring optimization to deliver faster and cheaper inference.

ml-systems distributed-systems
4 min
Meta

Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale

Meta needed to automatically identify and remediate performance inefficiencies across their massive infrastructure to reduce power consumption and free up engineering capacity.

observability distributed-systems
5 min
Meta

Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways

Meta needed to migrate its infrastructure and systems to post-quantum cryptography standards before quantum computers could break existing encryption schemes.

security distributed-systems
5 min
AWS

Build a multi-tenant configuration system with tagged storage patterns

Building a scalable multi-tenant configuration service that maintains strict tenant isolation while supporting real-time updates without cache staleness or downtime.

caching storage-systems
5 min
AWS

Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod

Simplifying the deployment and scheduling of machine learning inference workloads across multiple instances and instance types on Amazon SageMaker HyperPod.

ml-systems distributed-systems
4 min
Airbnb

Building a high-volume metrics pipeline with OpenTelemetry and vmagent

Migrating a large-scale metrics pipeline from StatsD to OpenTelemetry while handling production traffic volumes without losing data or blocking dependent systems.

observability distributed-systems
5 min
Cloudflare

500 Tbps of capacity: 16 years of scaling our global network

How to scale a global content delivery and DDoS mitigation network to handle massive throughput (500 Tbps) while maintaining capacity to protect against record-breaking attacks.

load-balancing distributed-systems
3 min
Cloudflare

Cloudflare targets 2029 for full post-quantum security

Cloudflare needed to prepare its global infrastructure and services for the threat of quantum computing attacks on current cryptographic standards before 2029.

security distributed-systems
4 min
Cloudflare

Welcome to Agents Week

How to enable AI agents to operate effectively at the edge of the internet with the security, performance, and reliability characteristics of Cloudflare's existing infrastructure.

distributed-systems security
4 min
Meta

Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases

Meta needed to modernize WebRTC across 50+ use cases while maintaining synchronization with upstream open-source development, avoiding the drift that typically occurs when large projects fork internally.

distributed-systems real-time-systems
5 min
Meta

How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines

AI coding assistants were ineffective at making useful edits in large-scale data pipelines because they lacked sufficient understanding of complex, multi-repository codebases spanning multiple languages and thousands of files.

distributed-systems ml-systems
5 min
Meta

Trust But Canary: Configuration Safety at Scale

Safely deploying configuration changes at scale while minimizing the risk of widespread failures caused by faulty configurations.

observability distributed-systems
5 min
AWS

Automate safety monitoring with computer vision and generative AI

Detecting safety hazards in real-time across hundreds of distributed operational sites using video feeds while maintaining low latency and managing the computational complexity of processing multiple camera streams.

real-time-systems distributed-systems
5 min
AWS

How Aigen transformed agricultural robotics for sustainable farming with Amazon SageMaker AI

Aigen needed to scale machine learning pipelines across hundreds of distributed edge solar robots while managing data labeling and model training challenges in agricultural robotics.

ml-systems distributed-systems
5 min
AWS

How Generali Malaysia optimizes operations with Amazon EKS

Generali Malaysia needed to optimize Kubernetes operations on AWS while reducing operational overhead, managing costs, and improving security posture.

distributed-systems security
4 min
AWS

Streamlining access to powerful disaster recovery capabilities of AWS

Organizations need a streamlined way to protect and recover entire AWS workloads across multiple layers (data, compute, infrastructure, networking, and configuration) in the event of a disaster.

storage-systems security
5 min
Cloudflare

Introducing EmDash — the spiritual successor to WordPress that solves plugin security

WordPress plugins pose significant security risks because they run with unrestricted access to the entire system, requiring a safer plugin architecture that isolates untrusted code.

security microservices
4 min
Cloudflare

Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers

Magic Transit customers needed the ability to define and enforce custom DDoS mitigation logic for proprietary and non-standard UDP protocols without being limited to Cloudflare's pre-built detection rules.

security distributed-systems
4 min
Cloudflare

Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver

How to design a public DNS resolver that prioritizes user privacy while maintaining performance and trustworthiness at scale.

security distributed-systems
4 min
Cloudflare

Why we're rethinking cache for the AI era

CDN cache systems were designed for human traffic patterns but struggle with the distinct access patterns of AI bot traffic, which now represents over 10 billion requests per week and threatens cache efficiency.

caching distributed-systems
4 min
Dropbox

Improving storage efficiency in Magic Pocket, our immutable blob store

Dropbox needed to improve storage efficiency and resilience in Magic Pocket, their immutable blob store, when handling variable and changing workloads.

storage-systems observability
3 min
Meta

KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure

Meta needed to automatically optimize low-level infrastructure and kernel-level parameters for AI ranking models to improve performance without manual tuning.

ml-systems distributed-systems
5 min
LinkedIn

AI Helping Build Better AI: How Agents Accelerate Model Experi...

Training and evaluating AI models is resource-intensive, requiring significant human effort to generate quality training data and assess model outputs.

ml-systems distributed-systems
3 min
LinkedIn

Reimagining LinkedIn’s search tech stack

LinkedIn's legacy search infrastructure couldn't scale to handle growing query volumes and evolving relevance requirements across its platform.

search distributed-systems
3 min
LinkedIn

Scaling LLM-Based ranking systems with SGLang at LinkedIn

LinkedIn's LLM-based ranking systems faced latency and throughput challenges when serving personalized results at scale.

ml-systems distributed-systems
3 min
AWS

6,000 AWS accounts, three people, one platform: Lessons learned

Managing 6,000 AWS accounts for a multi-tenant serverless SaaS platform with only three people created massive operational challenges around automation, observability, and cost management at scale.

distributed-systems microservices
4 min
AWS

Architecting conversational observability for cloud applications

Diagnosing and resolving issues in complex Kubernetes clusters is slow and requires expert knowledge, leading to high Mean Time to Recovery (MTTR) and heavy reliance on specialized engineers for root cause analysis.

observability ml-systems
4 min
AWS

Digital Transformation at Santander: How Platform Engineering is Revolutionizing Cloud Infrastructure

Santander struggled to manage cloud infrastructure supporting billions of daily transactions across 200+ critical systems, facing complexity and scalability challenges in their banking operations.

distributed-systems microservices
5 min
AWS

How BASF’s Agriculture Solutions drives traceability and climate action by tokenizing cotton value chains using Amazon Managed Blockchain

Agricultural supply chains (cotton/food) lack end-to-end traceability, making it difficult to verify sustainability claims, track climate impact, and ensure circularity across complex multi-party value chains.

distributed-systems security
4 min
AWS

How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters

Salesforce's Cluster Autoscaler could not efficiently scale and manage node provisioning across their fleet of 1,000+ EKS clusters, likely suffering from slow scaling decisions, suboptimal bin-packing, and operational complexity at massive scale.

distributed-systems load-balancing
4 min
AWS

Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall

Securing Amazon Elastic VMware Service (EVS) environments requires centralized traffic inspection across multiple VPCs, on-premises data centers, and internet egress points, which is complex to architect and implement.

security distributed-systems
4 min
AWS

Sovereign failover – Design for digital sovereignty using the AWS European Sovereign Cloud

Organizations operating under European digital sovereignty requirements need resilient failover capabilities, but regulatory constraints on data residency and governance make cross-partition (sovereign-to-commercial cloud) failover architecturally complex.

distributed-systems security
4 min
AWS

The Hidden Price Tag: Uncovering Hidden Costs in Cloud Architectures with the AWS Well-Architected Framework

Organizations migrating to or operating in the cloud encounter hidden and unexpected costs due to suboptimal architectural decisions, resource misconfigurations, and lack of adherence to cloud best practices.

distributed-systems storage-systems
5 min
Airbnb

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store

Airbnb's multi-tenant key-value store (Mussel) used static rate limiting that couldn't adapt to varying traffic patterns and spikes, risking degraded performance and reliability for all tenants during surges.

rate-limiting distributed-systems
5 min
Airbnb

My Journey to Airbnb — Anna Sulkina

This article is a personal profile of a Senior Director of Engineering at Airbnb rather than a technical post addressing a specific engineering challenge. It highlights her role overseeing Application & Cloud infrastructure but does not detail a specific system problem.

distributed-systems
5 min
Airbnb

Pay As a Local

Airbnb relied primarily on card payments across 220+ global markets, but many users preferred local payment methods, causing checkout friction, reduced accessibility, and lower adoption in key markets.

api-design microservices
5 min
Airbnb

Safeguarding Dynamic Configuration Changes at Scale

Dynamic configuration changes at scale can cause widespread outages if rolled out unsafely—a single bad config update can immediately affect all services and requests without the safety net of a gradual deployment process.

distributed-systems microservices
5 min
Cloudflare

A QUICker SASE client: re-building Proxy Mode

The Cloudflare One SASE client's Proxy Mode relied on user-space TCP stacks for tunneling traffic, introducing significant overhead that limited throughput and increased latency for end users.

distributed-systems api-design
4 min
Cloudflare

Complexity is a choice. SASE migrations shouldn’t take years.

Enterprise SASE (Secure Access Service Edge) migrations traditionally take 18+ months due to architectural complexity, requiring organizations to integrate networking and security across global infrastructure.

security distributed-systems
3 min
Cloudflare

Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient

Tunnel layering in Cloudflare's WARP/One client caused MTU mismatches, leading to silently dropped oversized packets that degraded connectivity and resilience.

distributed-systems real-time-systems
4 min
Cloudflare

How Automatic Return Routing solves IP overlap

Enterprises connecting multiple private networks via tunnels frequently encounter overlapping IP address ranges (e.g., multiple sites using 10.0.0.0/8), making traditional routing tables unable to determine which tunnel should receive return traffic.

distributed-systems security
4 min
Cloudflare

Inside Gen 13: how we built our most powerful server yet

Cloudflare's existing server fleet could not keep pace with rapidly growing global traffic demands, requiring a new generation of hardware with significantly higher compute and network throughput.

distributed-systems load-balancing
4 min
Cloudflare

Introducing Custom Regions for precision data control

Customers needed precise control over where their data is processed geographically to meet diverse compliance requirements (e.g., GDPR, data sovereignty laws), but existing pre-defined regional options were too coarse-grained to cover all regulatory and performance needs.

distributed-systems security
4 min
Cloudflare

Launching Cloudflare’s Gen 13 servers: trading cache for cores for 2x edge compute performance

Cloudflare needed to significantly increase edge compute throughput per server but faced a tradeoff where high-core-count CPUs came with smaller per-core L3 cache, risking latency penalties for cache-dependent workloads.

distributed-systems caching
4 min
Cloudflare

Powering the agents: Workers AI now runs large models, starting with Kimi K2.5

Running large AI models for agent workloads on edge infrastructure was cost-prohibitive and required significant inference stack optimization to serve models like Kimi K2.5 efficiently at scale.

ml-systems distributed-systems
4 min
Meta

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

Connecting thousands of GPUs across multiple data centers and regions for gigawatt-scale AI training clusters requires seamlessly bridging different network fabrics, which creates massive networking and interconnect challenges.

distributed-systems ml-systems
5 min
Meta

FFmpeg at Meta: Media Processing at Scale

Meta needed to handle massive-scale media processing (encoding, transcoding, filtering) across its family of apps, requiring efficient orchestration of complex audio/video pipelines using FFmpeg at an unprecedented scale.

storage-systems distributed-systems
5 min
Meta

Friend Bubbles: Enhancing Social Discovery on Facebook Reels

Facebook Reels needed a way to enhance social discovery by surfacing content that friends have interacted with, requiring real-time computation of relationship strength and ranking of friend-engaged content at massive scale.

ml-systems real-time-systems
5 min
Meta

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Meta's large-scale infrastructure relies on jemalloc for memory allocation, but the codebase had accumulated maintenance burden and needed modernization to keep pace with evolving hardware and workload demands.

storage-systems distributed-systems
5 min
Meta

RCCLX: Innovating GPU Communications on AMD Platforms

GPU-to-GPU communication performance on AMD platforms was insufficient for Meta's evolving AI model training workloads, and the standard RCCL library didn't meet the performance and flexibility requirements of their internal workloads.

distributed-systems ml-systems
5 min
Netflix

Automating RDS Postgres to Aurora Postgres Migration

Netflix's relational database ecosystem lacked standardization, with databases spread across RDS Postgres and other technologies, leading to inconsistent functionality, suboptimal performance, and higher total cost of ownership.

databases distributed-systems
5 min
Netflix

How Temporal Powers Reliable Cloud Operations at Netflix

Netflix needed reliable orchestration for business-critical cloud operations across teams like Open Connect CDN and Live reliability, but faced operational challenges as Temporal adoption grew since 2021.

distributed-systems microservices
5 min
Netflix

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Netflix needed to spin up hundreds of containers in seconds to serve streaming traffic, but after modernizing their container runtime, they hit an unexpected performance bottleneck rooted in CPU architecture that impaired container scaling efficiency.

distributed-systems real-time-systems
5 min
Netflix

Netflix Live Origin

Netflix needed a custom origin server to bridge its cloud-based live streaming pipelines with its CDN (Open Connect), handling the unique challenges of live content delivery such as low-latency requirements, reliability, and the real-time nature of live streams compared to on-demand content.

real-time-systems distributed-systems
5 min
Netflix

Scaling Global Storytelling: Modernizing Localization Analytics at Netflix

Netflix's localization analytics infrastructure (tracking dubbing, subtitling, and translation across hundreds of languages and regions) could not keep pace with the rapidly growing scale of global content, making it difficult to derive timely insights for content localization decisions.

databases distributed-systems
5 min
Netflix

Scaling LLM Post-Training at Netflix

Generic pre-trained LLMs lack the domain-specific alignment needed for Netflix's production use cases in recommendation, personalization, and search, and the post-training pipeline to fine-tune them doesn't scale efficiently across multiple domain constraints and reliability requirements.

ml-systems distributed-systems
5 min