Archives — Distributed Readings

AWS ↗

Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events

How to design systems that can recover from ransomware and destructive cyberattacks when backups, credentials, and infrastructure components have been compromised.

security storage-systems

4 min

AWS ↗

How ALS GeoAnalytics LITHOLENS ™ revolutionizes core logging through machine learning with Amazon EKS

ALS GeoAnalytics needed to scale machine learning model training and inference for core logging analysis while managing computational costs effectively.

distributed-systems ml-systems

3 min

AWS ↗

How Synthesia optimizes generative AI video inference on Amazon EC2 G7e instances

Synthesia needed to maximize GPU utilization during video inference on EC2 G7e instances by reducing idle time caused by sequential GPU compute, data transfer, and post-processing operations.

ml-systems real-time-systems

5 min

Airbnb ↗

Scaling Airbnb’s identity graph with a unified knowledge graph infrastructure

Airbnb needed to scale their identity graph infrastructure to efficiently resolve user identities and understand relationships between entities across their platform.

databases distributed-systems

5 min

Cloudflare ↗

Announcing Claude Compliance API support with Cloudflare CASB

Security teams needed visibility and compliance monitoring of Claude Enterprise API usage across their organization without leaving their existing security infrastructure.

security api-design

3 min

Cloudflare ↗

Announcing Claude Managed Agents on Cloudflare

Enabling developers to deploy and scale autonomous agent workflows globally while maintaining security isolation and control over access to private backend systems.

distributed-systems security

4 min

Cloudflare ↗

Project Glasswing: what Mythos showed us

Determining whether security-focused LLMs can effectively identify vulnerabilities in live production infrastructure code at scale.

security ml-systems

4 min

Dropbox ↗

Introducing Nova, our internal platform for coding agents

Enabling engineers to run multiple concurrent coding sessions and integrating AI agents into automated internal workflows at scale.

microservices api-design

3 min

Google ↗

A Smarter Google AI Edge Gallery: MCP integration, notifications, and session continuity

Enable on-device AI models to coordinate complex tasks across external data sources while maintaining persistent user context and proactive engagement without relying solely on cloud connectivity.

api-design ml-systems

5 min

Google ↗

A2UI v0.9: The New Standard for Portable, Framework-Agnostic Generative UI

AI agents needed a standardized way to generate UI components that work across different platforms and frameworks without being tightly coupled to any specific technology stack.

api-design real-time-systems

5 min

Google ↗

Accelerating on-device AI: A look at Arm and Google AI Edge optimization

Enabling efficient execution of generative AI models on edge devices with limited computational resources while maintaining acceptable latency and performance.

ml-systems api-design

5 min

Google ↗

Agents CLI in Agent Platform: create to production in one CLI

Developers face high context overhead and token waste when scaffolding AI agents locally and struggle to bridge the gap between development environments and production-grade deployment on Google Cloud.

api-design microservices

5 min

Google ↗

All the news from the Google I/O 2026 Developer keynote

How to enable developers to build applications powered by autonomous AI agents rather than traditional assistive AI interfaces.

api-design sdks

5 min

Google ↗

An important update: Transitioning Gemini CLI to Antigravity CLI

Google needed to unify fragmented AI terminal tooling by consolidating the community-focused Gemini CLI into a more scalable, agent-first platform capable of handling complex multi-agent workflows.

api-design microservices

5 min

Google ↗

Announcing ADK for Kotlin and ADK for Android 0.1.0: Building AI Agents on Android and Beyond

Developers needed a way to build AI agent workflows that could run on Android devices and backend systems without reinventing the core agentic logic across different platforms.

api-design sdks

3 min

Google ↗

Announcing Genkit Middleware: Intercept, extend, and harden your agentic apps

Developers need a way to reliably control, monitor, and extend AI model generation calls in production agentic applications without modifying core business logic.

api-design ml-systems

5 min

Google ↗

Blazing fast on-device GenAI with LiteRT-LM

Running large language models efficiently on mobile and edge devices while preserving multimodal and agentic capabilities without requiring server-side inference.

ml-systems mobile-platforms

5 min

Google ↗

Build Long-running AI agents that pause, resume, and never lose context with ADK

Building production-grade AI agents that can maintain context and state across long-running enterprise workflows spanning days or weeks without losing information during idle periods or server restarts.

api-design distributed-systems

5 min

Google ↗

Building real-world on-device AI with LiteRT and NPU

Mobile developers faced performance and battery inefficiency when running AI models on CPU/GPU, limiting real-time AI applications on edge devices.

api-design ml-systems

5 min

Google ↗

Building with Gemini Embedding 2: Agentic multimodal RAG and beyond

Developers needed a unified embedding model capable of processing interleaved multimodal inputs (text, images, video, audio, documents) in a single semantic space for tasks like retrieval-augmented generation and visual search.

api-design ml-systems

5 min

Google ↗

Empowering Service Providers and Hardware Partners with Gemini for Home

How can Google enable third-party service providers and hardware manufacturers to build intelligent smart home experiences without requiring deep AI/ML expertise or significant R&D investment?

api-design ml-systems

5 min

Google ↗

Google Tensor SDK Beta with LiteRT

Developers needed a unified way to build, deploy, and run high-performance machine learning models directly on edge devices (Google Pixel TPU) with reliable fallback mechanisms.

ml-systems api-design

5 min

Google ↗

MaxText Expands Post-Training Capabilities: Introducing SFT and RL on Single-Host TPUs

Enabling efficient post-training of large language models on single-host TPU configurations without requiring complex multi-host distributed setups.

ml-systems distributed-systems

5 min

Google ↗

New enhancements for merchant initiated transactions with the Google Pay API

Merchants needed greater flexibility and control when initiating payment transactions for recurring subscriptions, deferred payments, and automatic reloads while maintaining user transparency.

api-design

5 min

Google ↗

One Year of Innovation: Celebrating 100k Members in the Google Cloud x NVIDIA Developer Community

Developers needed accessible infrastructure, resources, and structured learning pathways to effectively build and optimize AI applications using GPUs and large language models at scale.

api-design ml-systems

5 min

Google ↗

Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith

Converting a brittle, monolithic sales research AI prototype into a production-ready agent that eliminates silent failures, fragile parsing, and lacks observability.

microservices observability

5 min

Google ↗

Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket

AI training pipelines were bottlenecked by slow data I/O when accessing training datasets stored in Google Cloud, limiting throughput and increasing total training time.

storage-systems ml-systems

5 min

Google ↗

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

Autoregressive LLM decoding suffers from sequential bottlenecks where tokens must be generated one-at-a-time, limiting throughput and inference speed on hardware accelerators like TPUs.

ml-systems real-time-systems

5 min

Google Cloud ↗

Agent Factory Recap: How Gemma 4 Taught Itself Physics

How to deploy high-intelligence AI models with agentic capabilities to consumer hardware and mobile devices without requiring cloud infrastructure.

ml-systems distributed-systems

5 min

Google Cloud ↗

Building Event-Driven Data Agents with BigQuery, Pub/Sub, and ADK

Enterprise systems need to react to events in real-time rather than relying on slow batch jobs or inefficient polling microservices that create dangerous delays in detecting critical issues like fraud or supply chain disruptions.

real-time-systems messaging-queues

5 min

Google Cloud ↗

Cloud Engineer’s AI Toolkit: Sign up Now for a Developer Workshop Near You!

Organizations need to securely build, deploy, and govern autonomous AI agents at enterprise scale as the industry transitions from experimental LLMs to production agentic AI systems.

ml-systems security

5 min

Google Cloud ↗

Create Expert Content: Deploying a Multi-Agent System with Terraform and Cloud Run

Automating the transformation of raw community signals into reliable technical guidance at scale using multiple specialized agents.

microservices api-design

5 min

Google Cloud ↗

Five must-have guides to move agents into production with Gemini Enterprise Agent Platform

Deploying and managing AI agents at scale in production requires infrastructure for state management, security governance, and complex workflow orchestration that goes beyond demo implementations.

distributed-systems security

5 min

Google Cloud ↗

From keynote to the terminal: Join our Next ‘26 developer livestreams

Google Cloud needed to bridge the gap between high-level keynote announcements and practical implementation details that developers could immediately apply.

general observability

5 min

Google Cloud ↗

Gemini Live Agent Challenge: Announcing the winners and highlights

How to enable developers to build multimodal AI agents that can process and respond to real-time audio, video, text, and generation capabilities beyond traditional text-based interfaces.

real-time-systems api-design

5 min

Google Cloud ↗

How BASF manages thousands of supply chain decisions with AlphaEvolve’s agentic algorithms

BASF needed to manage and optimize thousands of interdependent supply chain decisions across 180 global production sites where weather and regulatory changes can cause cascading disruptions in a two-year production pipeline.

distributed-systems ml-systems

5 min

Google Cloud ↗

Introducing Gemini Enterprise Agent Platform, powering the next wave of agents

Building safe, reliable, and autonomous agents that can act independently across multiple enterprise systems while maintaining security, governance, and reliability guardrails.

ml-systems security

5 min

Google Cloud ↗

Introducing the Builders Hub from the Google Developer Program

Developers lose productivity navigating fragmented tooling across multiple consoles, documentation sites, and services to manage their projects and stay informed.

api-design general

5 min

Google Cloud ↗

Level Up Your Agents: Announcing Google's Official Skills Repository

AI agents built on Google Cloud need access to accurate, current, and grounded information about Google's products and APIs to function effectively.

api-design ml-systems

5 min

Google Cloud ↗

Migrating to Google Cloud’s Application Load Balancer: A practical guide

Migrating business-critical load balancer configurations from on-premises hardware solutions to Google Cloud while preserving existing traffic manipulation logic.

load-balancing distributed-systems

5 min

Google Cloud ↗

Next '26 Hands-On: 10 Codelabs to Build Featured Tech

How to help developers transition from understanding AI concepts to building and maintaining production agentic systems in cloud environments.

observability microservices

5 min

Google Cloud ↗

Next ‘26: Redefining security for the AI era with Google Cloud and Wiz

Organizations need to secure their AI systems and infrastructure against emerging AI-era threats while maintaining the ability to leverage AI's potential at scale.

security distributed-systems

5 min

Google Cloud ↗

Pioneering AI-assisted code migration: How Google achieved 6x faster migration from TensorFlow to JAX

Google needed to accelerate large-scale codebase migrations (TensorFlow to JAX) that are too complex and interconnected for manual developer effort or standard AI coding tools to handle efficiently.

ml-systems general

5 min

Google Cloud ↗

Securing Your Gemini and Google API Keys

Developers using Google's AI APIs (Gemini and Google APIs) are exposing their API keys to unauthorized access, leading to account compromise, token theft, and service abuse.

security api-design

5 min

Google Cloud ↗

Ship code within minutes with the Gemini CLI DevOps Extension

Developers avoid deploying applications because the deployment process (containerization, CI/CD, IAM configuration) is time-consuming and interrupts the fast inner development loop.

devops general

5 min

Google Cloud ↗

Shipping features to production just got easier with new feature flags in AppLifecycle Manager

Development teams struggle to safely deploy code to production while managing the risk of releasing features to all users simultaneously, especially as AI accelerates code generation faster than safe deployment practices can keep up.

devops observability

5 min

Google Cloud ↗

What Google I/O '26 means for developing agents on Google Cloud

Developers needed a unified, secure way to build AI agents locally and deploy them to Google Cloud with standardized protocols and tooling.

api-design microservices

5 min

Google Cloud ↗

What’s new with the Cross-Cloud Network at Next ‘26

Enabling seamless connectivity, governance, and security across multi-agent AI systems and core applications distributed globally at planet scale.

distributed-systems microservices

5 min

Spotify ↗

Better Experiments with LLM Evals — A funnel, not a fork

Efficiently evaluating and validating LLM-generated outputs at scale during experimentation without manual review bottlenecks.

ml-systems observability

4 min

AWS ↗

Building hybrid multi-tenant architecture for stateful services on AWS

Building a multi-tenant architecture that isolates tenants without requiring separate AWS accounts while maintaining stateful service deployments.

load-balancing distributed-systems

5 min

AWS ↗

Choosing between single or multiple organizations in AWS Organizations

Organizations must determine whether to operate under a single AWS organization or split into multiple organizations based on their operational, security, and scaling requirements.

security distributed-systems

4 min

AWS ↗

Streaming CloudWatch metrics to VPC-based OpenTelemetry collectors using Lambda

Streaming CloudWatch metrics to internal VPC-based OpenTelemetry collectors without exposing them to the internet.

observability serverless

4 min

Airbnb ↗

Viaduct 1.0 and the future of Airbnb’s data mesh

Airbnb needed to transition Viaduct from an internal-only data mesh tool to a production-ready, community-driven platform with a stable public API.

api-design distributed-systems

5 min

Cloudflare ↗

Browser Run: now running on Cloudflare Containers, it’s faster and more scalable

Browser Run needed higher usage limits, better performance, and improved reliability while increasing development velocity for their browser automation service.

distributed-systems load-balancing

3 min

Cloudflare ↗

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.

databases observability

4 min

Cloudflare ↗

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

CUBIC congestion control algorithm's congestion window was becoming pinned at minimum values in QUIC, causing severe performance degradation due to incorrect idle period detection.

networking security

4 min

Meta ↗

Labyrinth 1.1: Making End-to-End Encrypted Backups Even More Reliable

Ensuring end-to-end encrypted messages and conversation history survive device loss, device switches, and extended offline periods without compromising encryption guarantees.

storage-systems security

5 min

Meta ↗

Migrating Data Ingestion Systems at Meta Scale

Meta needed to migrate their legacy data ingestion system to a new architecture while maintaining reliability and consistency for real-time social graph snapshots at massive scale.

distributed-systems storage-systems

5 min

Meta ↗

Reel Friends: Building Social Discovery that Scales to Billions

Building a social discovery system that efficiently surfaces Reels watched and reacted to by friends while scaling to billions of users.

caching distributed-systems

5 min

Stripe ↗

Five vertical SaaS insights from Sessions 2026

Vertical SaaS platforms needed to expand their service offerings beyond pure software to include integrated payments, financial services, and agentic commerce capabilities to build more defensible and durable businesses.

api-design distributed-systems

3 min

Airbnb ↗

Monitoring reliably at scale

Designing monitoring and observability systems that remain functional and reliable even when the core infrastructure they monitor is failing or degraded.

observability distributed-systems

5 min

Cloudflare ↗

Building for the future

The article summary provided does not contain sufficient technical content to identify a specific engineering problem being solved.

4 min

Cloudflare ↗

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Rapidly detect, investigate, and mitigate a critical Linux kernel privilege escalation vulnerability across a global edge computing fleet without impacting customers.

security distributed-systems

4 min

Cloudflare ↗

When DNSSEC goes wrong: how we responded to the .de TLD outage

When DENIC published invalid DNSSEC signatures for the .de TLD, DNS resolvers like 1.1.1.1 faced a critical decision: reject all .de domain queries due to signature validation failures or serve potentially stale cached responses to maintain availability.

caching distributed-systems

4 min

Netflix ↗

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix needed to manage the lifecycle of machine learning models across multiple domains and teams at scale, moving beyond their original single-domain personalization focus.

ml-systems microservices

5 min

Netflix ↗

Evaluating Netflix Show Synopses with LLM-as-a-Judge

Netflix needed to automatically evaluate the quality and relevance of show synopses at scale to improve member discovery and engagement.

ml-systems api-design

5 min

Netflix ↗

Powering Multimodal Intelligence for Video Search

Netflix needed to efficiently extract and surface key moments from hundreds or thousands of hours of raw video footage for editorial teams to accelerate the creative content production process.

ml-systems search

5 min

Netflix ↗

Scaling ArchUnit with Nebula ArchRules

Netflix needed a way to enforce consistent architectural patterns and build standards across tens of thousands of Java repositories in their polyrepo strategy.

microservices general

5 min

Netflix ↗

Scaling Camera File Processing at Netflix

Netflix needed to build a scalable, flexible media file processing pipeline that could handle diverse camera formats, workflows, and production requirements while maintaining quick turnaround times for global content production.

microservices distributed-systems

5 min

Netflix ↗

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events

Netflix needed to optimize bandwidth utilization and video quality for live streaming events at global scale by moving from constant bitrate to variable bitrate encoding.

real-time-systems distributed-systems

5 min

Netflix ↗

State of Routing in Model Serving

Netflix needed to design a domain-independent traffic routing system for their ML model serving infrastructure that could handle personalized experiences at scale across multiple domains while maintaining high availability.

microservices load-balancing

5 min

Netflix ↗

Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale

Query performance degradation at massive scale (10+ trillion rows, 15M events/second) where repeated identical queries were consuming excessive resources and impacting latency.

caching databases

5 min

Netflix ↗

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Netflix needed to build reliable operations infrastructure to support live streaming at massive scale, going from one show per month to nine shows per day with tens of millions of concurrent viewers.

microservices observability

5 min

Spotify ↗

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations (Honk, Part 4)

Spotify needed to migrate thousands of downstream datasets when source datasets changed structure, without manually updating each consumer application.

data-pipelines microservices

4 min

Spotify ↗

Building a Natural Language Interface to the Spotify Ads API with Claude Code Plugins

Making the Spotify Ads API accessible to non-technical users and reducing friction in ad campaign management by enabling natural language interaction instead of requiring direct API integration.

api-design microservices

4 min

Spotify ↗

Inside the Archive: The Tech Behind Your 2025 Wrapped Highlights

How to identify and surface the most interesting and meaningful listening moments from a year's worth of user streaming data to create personalized narrative highlights for Wrapped.

data-pipelines ml-systems

4 min

Spotify ↗

Let’s Talk Agentic Development: Spotify x Anthropic Live

How can software engineers leverage AI agents to improve development workflows and productivity at scale?

general

4 min

Spotify ↗

Our Multi-Agent Architecture for Smarter Advertising

Spotify needed to optimize ad targeting and delivery at scale by coordinating multiple specialized systems to make smarter advertising decisions rather than relying on monolithic ad selection logic.

microservices ml-systems

4 min

Stripe ↗

10 things we learned building for the first generation of agentic commerce

Building reliable payment and commerce systems that can handle autonomous AI agents as buyers, which introduce new failure modes and consistency requirements not present in traditional e-commerce.

api-design distributed-systems

4 min

Stripe ↗

Analyzing first-party fraud trends: Account, free trial, and refund abuse

Detecting and preventing first-party fraud at scale across a payment network where legitimate users abuse policies through multiple accounts, free trial cycling, and refund exploitation.

ml-systems security

4 min

Stripe ↗

Everything we announced at Sessions 2026

How to make payment infrastructure more programmable while maintaining reliability across a global distributed network and enabling new use cases like AI economic infrastructure.

api-design distributed-systems

3 min

Stripe ↗

Giving agents the ability to pay

Enable autonomous agents to programmatically access payment instruments and execute transactions without requiring human intervention or direct card/account access.

api-design security

4 min

Stripe ↗

How Stripe Radar helps prevent free trial abuse

Detecting and preventing fraudulent behavior in free trial signups, such as repeated trial abuse and missed cancellations, at scale with high accuracy.

ml-systems api-design

4 min

Stripe ↗

How agents, digital wallets, and trust are rewriting checkout

Understanding and optimizing the checkout conversion funnel across diverse ecommerce businesses to identify what drives successful transactions in modern online payment flows.

api-design real-time-systems

4 min

Stripe ↗

Insights from Shoptalk 2026: How agents are changing retail

How to integrate AI agents into ecommerce platforms to enable seamless product discovery and checkout across embedded and third-party surfaces.

api-design real-time-systems

4 min

Stripe ↗

Introducing the Machine Payments Protocol

Enable autonomous agents and machines to initiate and complete payments programmatically over the internet without requiring human intermediation.

api-design distributed-systems

4 min

Stripe ↗

Testing the impact of Adaptive Pricing across 1.5M subscription checkout sessions

How to automatically localize subscription pricing across 150+ countries while measuring the business impact of dynamic pricing on conversion and lifetime value.

api-design observability

4 min

Stripe ↗

Three of the biggest fraud trends from MRC Vegas 2026

Detecting and preventing sophisticated fraud attacks while minimizing friction for legitimate users in payment systems.

api-design security

4 min

AWS ↗

Deloitte optimizes EKS environment provisioning and achieves 89% faster testing environments using Amazon EKS and vCluster

Deloitte needed to significantly reduce the time required to provision and spin up testing environments for their Kubernetes workloads.

distributed-systems microservices

3 min

Airbnb ↗

Skipper: Building Airbnb’s embedded workflow engine

How to build a durable workflow execution engine that can recover from failures mid-process without losing state or duplicating work.

distributed-systems databases

5 min

Cloudflare ↗

Agents can now create Cloudflare accounts, buy domains, and deploy

How to enable autonomous agents to programmatically create Cloudflare accounts, purchase domains, and deploy infrastructure without manual dashboard interaction or credential handling.

api-design security

4 min

Cloudflare ↗

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

Cloudflare needed to make their global edge infrastructure more resilient to configuration changes and prevent widespread outages caused by unsafe deployments.

distributed-systems observability

4 min

Cloudflare ↗

Introducing Dynamic Workflows: durable execution that follows the tenant

Enable multi-tenant platforms to execute millions of unique, durable workflows without incurring significant idle infrastructure costs.

distributed-systems microservices

4 min

Cloudflare ↗

Post-quantum encryption for Cloudflare IPsec is generally available

Protecting IPsec communications from future quantum computing threats while maintaining current interoperability with existing infrastructure.

security distributed-systems

3 min

Cloudflare ↗

Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions

How to measure, analyze, and publicly report on Internet disruptions caused by geopolitical events, infrastructure attacks, and power outages in real-time across global networks.

observability distributed-systems

4 min

Meta ↗

How Meta Is Strengthening End-to-End Encrypted Backups

How to enable end-to-end encrypted backups for messaging applications while ensuring recovery codes remain inaccessible to Meta, cloud providers, and other third parties.

security storage-systems

5 min

AWS ↗

Modernizing KYC with AWS serverless solutions and agentic AI for financial services

Traditional rule-based KYC (Know Your Customer) systems lack the autonomous decision-making capability and real-time validation speed needed for modern financial services compliance operations.

serverless real-time-systems

5 min

AWS ↗

PACIFIC enables multi-tenant, sovereign product carbon footprint exchange on the Catena-X data space using AWS

Enable multiple independent organizations to securely exchange Product Carbon Footprint (PCF) data within a shared data space while maintaining data sovereignty and tenant isolation.

microservices security

4 min

AWS ↗

Real-time analytics: Oldcastle integrates Infor with Amazon Aurora and Amazon Quick Sight

Oldcastle needed to overcome the limitations of traditional ERP reporting to enable real-time analytics and dashboards for their Infor ERP system.

databases real-time-systems

5 min

Airbnb ↗

Building a fault-tolerant metrics storage system at Airbnb

Building a metrics storage system capable of ingesting 50 million samples per second while reliably storing 2.5 petabytes of time series data at scale.

observability storage-systems

5 min

Cloudflare ↗

Building the agentic cloud: everything we launched during Agents Week 2026

How to enable developers to build and deploy AI agents at scale across a distributed edge computing network while maintaining security and providing necessary infrastructure tools.

distributed-systems security

4 min

Cloudflare ↗

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Rust panics in Cloudflare Workers were fatal and poisoned the entire worker instance, making applications unreliable when unhandled errors occurred.

security observability

4 min

Cloudflare ↗

Moving past bots vs. humans

Traditional bot detection mechanisms are becoming ineffective as AI assistants and privacy proxies blur the distinction between legitimate users and automated abuse.

security api-design

4 min

Cloudflare ↗

Orchestrating AI Code Review at scale

Cloudflare needed to scale code review processes across their engineering organization while maintaining code quality and security standards without overwhelming human reviewers.

ml-systems api-design

3 min

Cloudflare ↗

The AI engineering stack we built internally — on the platform we ship

Cloudflare needed to build an internal AI engineering stack that could handle massive scale (20 million requests, 241 billion tokens) while dogfooding their own platform products.

api-design ml-systems

4 min

Meta ↗

Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge

Facebook Groups Search was unreliable at helping users discover and validate community content most relevant to their search queries.

search ml-systems

5 min

Airbnb ↗

Privacy-first connections: Empowering social experiences at Airbnb

How can Airbnb enable social features and community connections while maintaining strict user privacy and giving users control over their personal data sharing?

security api-design

5 min

Cloudflare ↗

AI Search: the search primitive for your agents

Providing a scalable, efficient search infrastructure that allows AI agents to dynamically create search instances and perform semantic queries across uploaded documents without managing underlying indexing complexity.

search ml-systems

4 min

Cloudflare ↗

Add voice to your agent

Enabling developers to build conversational agents with real-time voice capabilities without requiring complex infrastructure setup.

real-time-systems api-design

4 min

Cloudflare ↗

Agents Week: network performance update

Cloudflare needed to improve request handling performance across its global network to maintain competitive advantage over other CDNs.

distributed-systems load-balancing

4 min

Cloudflare ↗

Agents that remember: introducing Agent Memory

AI agents lack persistent memory mechanisms to retain context, learn from interactions, and improve decision-making over time.

storage-systems ml-systems

3 min

Cloudflare ↗

Artifacts: versioned storage that speaks Git

Providing agents, developers, and automations with scalable, Git-compatible versioned storage that can handle tens of millions of repositories without forcing them to manage infrastructure.

storage-systems api-design

4 min

Cloudflare ↗

Browser Run: give your agents a browser

AI agents needed a way to interact with browsers at scale while maintaining visibility and control over automated actions, requiring higher concurrency and real-time debugging capabilities.

real-time-systems ml-systems

3 min

Cloudflare ↗

Building the foundation for running extra-large language models

How to efficiently run inference for extra-large language models on edge infrastructure while maintaining low latency and high throughput across distributed Cloudflare servers.

ml-systems distributed-systems

4 min

Cloudflare ↗

Cloudflare Email Service: now in public beta. Ready for your agents

Enabling AI agents to send, receive, and process email natively as a multi-channel communication medium without requiring developers to build custom email infrastructure.

api-design microservices

4 min

Cloudflare ↗

Cloudflare’s AI Platform: an inference layer designed for agents

Developers needed a unified way to access multiple AI model providers without managing separate integrations and API contracts for each one.

api-design microservices

4 min

Cloudflare ↗

Deploy Postgres and MySQL databases with PlanetScale + Workers

Enabling serverless applications to connect to managed relational databases without managing infrastructure or dealing with connection pooling complexities.

databases api-design

3 min

Cloudflare ↗

Introducing Agent Lee - a new interface to the Cloudflare stack

Users had to manually navigate multiple tabs and interfaces within the Cloudflare dashboard to troubleshoot issues and manage their infrastructure, creating friction in the workflow.

api-design security

4 min

Cloudflare ↗

Introducing Flagship: feature flags built for the age of AI

Third-party feature flag services introduce unacceptable latency for applications requiring sub-millisecond flag evaluation at global scale.

caching distributed-systems

4 min

Cloudflare ↗

Introducing the Agent Readiness score. Is your site agent-ready?

Website owners needed a way to measure and understand how well their sites support AI agents and web crawlers for indexing and integration.

api-design observability

4 min

Cloudflare ↗

Project Think: building the next generation of AI agents on Cloudflare

Building a scalable platform for deploying AI agents at the edge that can think, act, and persist state across distributed Cloudflare infrastructure.

distributed-systems ml-systems

3 min

Cloudflare ↗

Rearchitecting the Workflows control plane for the agentic era

Cloudflare Workflows needed to support higher concurrency and creation rate limits to enable durable background agents at scale.

distributed-systems rate-limiting

4 min

Cloudflare ↗

Redirects for AI Training enforces canonical content

AI crawlers were ingesting deprecated and non-canonical content despite soft directives like robots.txt, requiring a way to enforce canonical versions without modifying origin infrastructure.

caching security

4 min

Cloudflare ↗

Register domains wherever you build: Cloudflare Registrar API now in beta

Developers needed a programmatic way to register and manage domains without leaving their development workflow or switching between multiple tools and platforms.

api-design

4 min

Cloudflare ↗

Securing non-human identities: automated revocation, OAuth, and scoped permissions

Developers lack effective mechanisms to prevent unauthorized access when API credentials are accidentally exposed or compromised.

security api-design

4 min

Cloudflare ↗

Shared Dictionaries: compression that keeps up with the agentic web

Web pages are growing larger and slower to load due to increased dynamic content, requiring better compression techniques that can adapt to modern agentic web patterns.

caching api-design

3 min

Cloudflare ↗

Unweight: how we compressed an LLM 22% without sacrificing quality

GPU memory bandwidth constraints were limiting LLM inference efficiency across Cloudflare's distributed edge network, requiring optimization to deliver faster and cheaper inference.

ml-systems distributed-systems

4 min

Meta ↗

Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale

Meta needed to automatically identify and remediate performance inefficiencies across their massive infrastructure to reduce power consumption and free up engineering capacity.

observability distributed-systems

5 min

Meta ↗

Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways

Meta needed to migrate its infrastructure and systems to post-quantum cryptography standards before quantum computers could break existing encryption schemes.

security distributed-systems

5 min

AWS ↗

Build a multi-tenant configuration system with tagged storage patterns

Building a scalable multi-tenant configuration service that maintains strict tenant isolation while supporting real-time updates without cache staleness or downtime.

caching storage-systems

5 min

AWS ↗

Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod

Simplifying the deployment and scheduling of machine learning inference workloads across multiple instances and instance types on Amazon SageMaker HyperPod.

ml-systems distributed-systems

4 min

Airbnb ↗

Building a high-volume metrics pipeline with OpenTelemetry and vmagent

Migrating a large-scale metrics pipeline from StatsD to OpenTelemetry while handling production traffic volumes without losing data or blocking dependent systems.

observability distributed-systems

5 min

Cloudflare ↗

500 Tbps of capacity: 16 years of scaling our global network

How to scale a global content delivery and DDoS mitigation network to handle massive throughput (500 Tbps) while maintaining capacity to protect against record-breaking attacks.

load-balancing distributed-systems

3 min

Cloudflare ↗

Cloudflare targets 2029 for full post-quantum security

Cloudflare needed to prepare its global infrastructure and services for the threat of quantum computing attacks on current cryptographic standards before 2029.

security distributed-systems

4 min

Cloudflare ↗

From bytecode to bytes: automated magic packet generation

Cloudflare needed to automatically generate malware trigger packets for BPF bytecode analysis, which previously required hours of manual work.

security

3 min

Cloudflare ↗

How we built Organizations to help enterprises manage Cloudflare at scale

Cloudflare needed to enable enterprise customers to manage multiple accounts and resources under a unified organizational structure with centralized authorization and access control.

api-design security

4 min

Cloudflare ↗

Welcome to Agents Week

How to enable AI agents to operate effectively at the edge of the internet with the security, performance, and reliability characteristics of Cloudflare's existing infrastructure.

distributed-systems security

4 min

Meta ↗

Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases

Meta needed to modernize WebRTC across 50+ use cases while maintaining synchronization with upstream open-source development, avoiding the drift that typically occurs when large projects fork internally.

distributed-systems real-time-systems

5 min

Meta ↗

How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines

AI coding assistants were ineffective at making useful edits in large-scale data pipelines because they lacked sufficient understanding of complex, multi-repository codebases spanning multiple languages and thousands of files.

distributed-systems ml-systems

5 min

Meta ↗

Trust But Canary: Configuration Safety at Scale

Safely deploying configuration changes at scale while minimizing the risk of widespread failures caused by faulty configurations.

observability distributed-systems

5 min

AWS ↗

Architecting for agentic AI development on AWS

AI agents struggle to iterate rapidly on system design and codebases due to architectural patterns that limit their ability to understand, modify, and validate applications effectively.

microservices serverless

5 min

AWS ↗

Automate safety monitoring with computer vision and generative AI

Detecting safety hazards in real-time across hundreds of distributed operational sites using video feeds while maintaining low latency and managing the computational complexity of processing multiple camera streams.

real-time-systems distributed-systems

5 min

AWS ↗

How Aigen transformed agricultural robotics for sustainable farming with Amazon SageMaker AI

Aigen needed to scale machine learning pipelines across hundreds of distributed edge solar robots while managing data labeling and model training challenges in agricultural robotics.

ml-systems distributed-systems

5 min

AWS ↗

How Generali Malaysia optimizes operations with Amazon EKS

Generali Malaysia needed to optimize Kubernetes operations on AWS while reducing operational overhead, managing costs, and improving security posture.

distributed-systems security

4 min

AWS ↗

Streamlining access to powerful disaster recovery capabilities of AWS

Organizations need a streamlined way to protect and recover entire AWS workloads across multiple layers (data, compute, infrastructure, networking, and configuration) in the event of a disaster.

storage-systems security

5 min

Airbnb ↗

My Journey to Airbnb — Jonathan Woodard

This article does not describe a specific engineering problem or technical solution.

security

5 min

Airbnb ↗

What COVID did to our forecasting models (and what we built to handle the next shock)

Building forecasting models that remain accurate during sudden market shocks like a global pandemic, where historical data no longer predicts future outcomes.

ml-systems observability

5 min

Cloudflare ↗

A one-line Kubernetes fix that saved 600 hours a year

Cloudflare's Atlantis instance took 30 minutes to restart due to a Kubernetes volume permission bottleneck.

observability storage-systems

4 min

Cloudflare ↗

Cloudflare Client-Side Security: smarter detection, now open to everyone

Detecting sophisticated client-side security threats like zero-day exploits while minimizing false positives in real-time across millions of requests.

security ml-systems

4 min

Cloudflare ↗

How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams

How to automatically convert TypeScript workflow code into visual step diagrams for users to understand and interact with their workflows in the dashboard.

api-design

3 min

Cloudflare ↗

Introducing EmDash — the spiritual successor to WordPress that solves plugin security

WordPress plugins pose significant security risks because they run with unrestricted access to the entire system, requiring a safer plugin architecture that isolates untrusted code.

security microservices

4 min

Cloudflare ↗

Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers

Magic Transit customers needed the ability to define and enforce custom DDoS mitigation logic for proprietary and non-standard UDP protocols without being limited to Cloudflare's pre-built detection rules.

security distributed-systems

4 min

Cloudflare ↗

Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver

How to design a public DNS resolver that prioritizes user privacy while maintaining performance and trustworthiness at scale.

security distributed-systems

4 min

Cloudflare ↗

Sandboxing AI agents, 100x faster

How to safely execute untrusted AI-generated code with minimal latency and resource overhead.

security edge-computing

4 min

Cloudflare ↗

Why we're rethinking cache for the AI era

CDN cache systems were designed for human traffic patterns but struggle with the distinct access patterns of AI bot traffic, which now represents over 10 billion requests per week and threatens cache efficiency.

caching distributed-systems

4 min

Dropbox ↗

Improving storage efficiency in Magic Pocket, our immutable blob store

Dropbox needed to improve storage efficiency and resilience in Magic Pocket, their immutable blob store, when handling variable and changing workloads.

storage-systems observability

3 min

Dropbox ↗

Reducing our monorepo size to improve developer velocity

Monorepo growth was causing increased build times, slower dependency resolution, and reduced developer velocity as the codebase expanded.

general observability

3 min

Meta ↗

AI for American-Produced Cement and Concrete

Designing high-quality, sustainable concrete mixes that are produced in the United States while optimizing for performance characteristics.

ml-systems general

5 min

Meta ↗

KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure

Meta needed to automatically optimize low-level infrastructure and kernel-level parameters for AI ranking models to improve performance without manual tuning.

ml-systems distributed-systems

5 min

Meta ↗

Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads

Meta needed to scale their ads ranking models to LLM-scale complexity and size while maintaining inference latency requirements for real-time ad serving.

ml-systems real-time-systems

5 min

LinkedIn ↗

AI Helping Build Better AI: How Agents Accelerate Model Experi...

Training and evaluating AI models is resource-intensive, requiring significant human effort to generate quality training data and assess model outputs.

ml-systems distributed-systems

3 min

LinkedIn ↗

Announcing Our LinkedIn-Cornell 2024 Grant Recipients

Advancing AI research requires collaboration between industry and academia, but funding and partnership models need structured programs.

ml-systems general

3 min

LinkedIn ↗

Career stories: Influencing engineering growth at LinkedIn

Growing engineering teams at scale requires clear career frameworks and mentorship to help engineers develop technical leadership skills.

general

3 min

LinkedIn ↗

Career stories: The math-music connection in data science

Data science teams need diverse skill sets that blend mathematical rigor with creative problem-solving to build effective ML systems.

ml-systems general

3 min

LinkedIn ↗

Driving data enhancement & recruitment success with LinkedIn’s unified integrations

LinkedIn's recruitment platform needed richer data signals to improve candidate matching and recruiter success rates.

search databases

3 min

LinkedIn ↗

Engineering the next generation of LinkedIn’s Feed

LinkedIn's Feed needed to evolve to handle increasing content diversity, real-time ranking signals, and personalization at massive scale.

real-time-systems ml-systems

3 min

LinkedIn ↗

Introducing Northguard and Xinfra: scalable log storage at LinkedIn

LinkedIn's logging infrastructure couldn't scale cost-effectively to handle the massive volume of operational logs across thousands of services.

observability storage-systems

3 min

LinkedIn ↗

Optimizing LinkedIn Sales Navigator’s search pipeline with Spark

LinkedIn Sales Navigator's search pipeline had latency issues as query complexity and data volume grew.

search caching

3 min

LinkedIn ↗

Reimagining LinkedIn’s search tech stack

LinkedIn's legacy search infrastructure couldn't scale to handle growing query volumes and evolving relevance requirements across its platform.

search distributed-systems

3 min

LinkedIn ↗

Scaling LLM-Based ranking systems with SGLang at LinkedIn

LinkedIn's LLM-based ranking systems faced latency and throughput challenges when serving personalized results at scale.

ml-systems distributed-systems

3 min

LinkedIn ↗

Securing every Kubernetes workload at scale

Securing thousands of Kubernetes workloads across a large-scale infrastructure requires automated and consistent security policies.

security microservices

3 min

LinkedIn ↗

The LinkedIn Generative AI Application Tech Stack: Personaliza...

Building personalized generative AI features at LinkedIn's scale required a robust and reliable application infrastructure that could serve millions of users.

ml-systems microservices

3 min

AWS ↗

6,000 AWS accounts, three people, one platform: Lessons learned

Managing 6,000 AWS accounts for a multi-tenant serverless SaaS platform with only three people created massive operational challenges around automation, observability, and cost management at scale.

distributed-systems microservices

4 min

AWS ↗

AI-powered event response for Amazon EKS

Responding to operational events in Amazon EKS clusters is often manual, slow, and requires deep expertise, making it difficult to handle incidents at scale across complex Kubernetes environments.

observability ml-systems

3 min

AWS ↗

Announcing the updated AWS Well-Architected Generative AI Lens

Organizations building generative AI workloads on AWS lacked comprehensive architectural guidance covering responsible AI, data architecture, and emerging patterns like agentic workflows, leading to poorly architected AI systems.

ml-systems api-design

4 min

AWS ↗

Announcing the updated AWS Well-Architected Machine Learning Lens

Organizations building ML workloads on AWS lacked up-to-date architectural guidance that incorporates the latest services, capabilities, and best practices, leading to sub-optimal ML system designs across reliability, performance, cost, and operational dimensions.

ml-systems

3 min

AWS ↗

Architecting conversational observability for cloud applications

Diagnosing and resolving issues in complex Kubernetes clusters is slow and requires expert knowledge, leading to high Mean Time to Recovery (MTTR) and heavy reliance on specialized engineers for root cause analysis.

observability ml-systems

4 min

AWS ↗

Architecting for AI excellence: AWS launches three Well-Architected Lenses at re:Invent 2025

Organizations deploying AI/ML workloads on AWS lacked comprehensive architectural guidance for building responsible, well-architected machine learning and generative AI systems at scale.

ml-systems

5 min

AWS ↗

BASF Digital Farming builds a STAC-based solution on Amazon EKS

BASF Digital Farming needed a scalable way to catalog, discover, and serve large volumes of spatiotemporal geospatial data (satellite imagery, crop data) for their xarvio crop optimization platform, and their existing infrastructure struggled with the scale and query patterns of this data.

microservices storage-systems

4 min

AWS ↗

Build priority-based message processing with Amazon MQ and AWS App Runner

Standard message queues process messages in FIFO order, lacking the ability to prioritize urgent messages over lower-priority ones, which can cause critical tasks to wait behind less important work during high load.

messaging-queues real-time-systems

5 min

AWS ↗

Building an AI gateway to Amazon Bedrock with Amazon API Gateway

Enterprises adopting Amazon Bedrock need centralized governance over AI model access, including authorization controls, usage quotas, and auditing, but lack a standardized gateway pattern to enforce these policies at scale.

api-design rate-limiting

4 min

AWS ↗

Digital Transformation at Santander: How Platform Engineering is Revolutionizing Cloud Infrastructure

Santander struggled to manage cloud infrastructure supporting billions of daily transactions across 200+ critical systems, facing complexity and scalability challenges in their banking operations.

distributed-systems microservices

5 min

AWS ↗

How Artera enhances prostate cancer diagnostics using AWS

Artera needed to develop and scale an AI-powered prostate cancer diagnostic test, requiring significant compute resources for model training/inference and a reliable pipeline to deliver timely, personalized treatment recommendations.

ml-systems storage-systems

4 min

AWS ↗

How BASF’s Agriculture Solutions drives traceability and climate action by tokenizing cotton value chains using Amazon Managed Blockchain

Agricultural supply chains (cotton/food) lack end-to-end traceability, making it difficult to verify sustainability claims, track climate impact, and ensure circularity across complex multi-party value chains.

distributed-systems security

4 min

AWS ↗

How Convera built fine-grained API authorization with Amazon Verified Permissions

Convera needed to implement fine-grained authorization for their API platform, where coarse-grained access controls were insufficient to manage complex permission requirements across API resources and actions.

api-design security

3 min

AWS ↗

How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters

Salesforce's Cluster Autoscaler could not efficiently scale and manage node provisioning across their fleet of 1,000+ EKS clusters, likely suffering from slow scaling decisions, suboptimal bin-packing, and operational complexity at massive scale.

distributed-systems load-balancing

4 min

AWS ↗

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

Organizations struggle to design well-architected cloud systems that balance cost optimization, security, reliability, and performance efficiency across increasingly complex AWS environments including AI-powered workloads.

security microservices

5 min

AWS ↗

Mastering millisecond latency and millions of events: The event-driven architecture behind the Amazon Key Suite

The Amazon Key Suite had a tightly coupled monolithic architecture that struggled with reliability and scalability when processing millions of events at millisecond latency requirements across multiple service integrations.

microservices messaging-queues

5 min

AWS ↗

Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall

Securing Amazon Elastic VMware Service (EVS) environments requires centralized traffic inspection across multiple VPCs, on-premises data centers, and internet egress points, which is complex to architect and implement.

security distributed-systems

4 min

AWS ↗

She architects: Bringing unique perspectives to innovative solutions at AWS

The article addresses the challenge of diverse representation and perspectives in cloud architecture roles, exploring how lack of varied viewpoints can limit innovation in technical solution design.

api-design

5 min

AWS ↗

Sovereign failover – Design for digital sovereignty using the AWS European Sovereign Cloud

Organizations operating under European digital sovereignty requirements need resilient failover capabilities, but regulatory constraints on data residency and governance make cross-partition (sovereign-to-commercial cloud) failover architecturally complex.

distributed-systems security

4 min

AWS ↗

The Hidden Price Tag: Uncovering Hidden Costs in Cloud Architectures with the AWS Well-Architected Framework

Organizations migrating to or operating in the cloud encounter hidden and unexpected costs due to suboptimal architectural decisions, resource misconfigurations, and lack of adherence to cloud best practices.

distributed-systems storage-systems

5 min

Airbnb ↗

Academic Publications & Airbnb Tech: 2025 Year in Review

Airbnb needed to advance its AI, data science, and machine learning capabilities across multiple domains (NLP, optimization, measurement science) to improve its travel and living platform, requiring solutions to challenges in search ranking, recommendation, experimentation, and large-scale data processing.

ml-systems search

5 min

Airbnb ↗

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store

Airbnb's multi-tenant key-value store (Mussel) used static rate limiting that couldn't adapt to varying traffic patterns and spikes, risking degraded performance and reliability for all tenants during surges.

rate-limiting distributed-systems

5 min

Airbnb ↗

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

Airbnb's reliance on multiple third-party observability vendors resulted in inconsistent data, fragmented developer experiences, and limitations in cost-effectiveness and reliability at their scale.

observability microservices

5 min

Airbnb ↗

GraphQL Data Mocking at Scale with LLMs and @generateMock

Producing valid and realistic mock data for GraphQL testing and prototyping is tedious to write and maintain; existing approaches like random value generation and field-level stubbing lack domain context, resulting in unconvincing and brittle test data that doesn't scale across a large schema.

api-design ml-systems

5 min

Airbnb ↗

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

Airbnb's Observability as Code alert development process had excessively long development cycles (weeks) due to cumbersome code review workflows, slowing down engineers' ability to create and iterate on alerts at scale across thousands of services.

observability microservices

5 min

Airbnb ↗

My Journey to Airbnb — Anna Sulkina

This article is a personal profile of a Senior Director of Engineering at Airbnb rather than a technical post addressing a specific engineering challenge. It highlights her role overseeing Application & Cloud infrastructure but does not detail a specific system problem.

distributed-systems

5 min

Airbnb ↗

My Journey to Airbnb: Peter Coles

Airbnb needed to build robust data science and economic modeling capabilities to understand and optimize their two-sided marketplace dynamics for policy and business decisions.

ml-systems

5 min

Airbnb ↗

Pay As a Local

Airbnb relied primarily on card payments across 220+ global markets, but many users preferred local payment methods, causing checkout friction, reduced accessibility, and lower adoption in key markets.

api-design microservices

5 min

Airbnb ↗

Recommending Travel Destinations to Help Users Explore

Airbnb users in the early trip planning stage often lack a clear travel destination, making it difficult to provide relevant recommendations and convert exploratory browsing into bookings.

ml-systems search

5 min

Airbnb ↗

Safeguarding Dynamic Configuration Changes at Scale

Dynamic configuration changes at scale can cause widespread outages if rolled out unsafely—a single bad config update can immediately affect all services and requests without the safety net of a gradual deployment process.

distributed-systems microservices

5 min

Cloudflare ↗

A QUICker SASE client: re-building Proxy Mode

The Cloudflare One SASE client's Proxy Mode relied on user-space TCP stacks for tunneling traffic, introducing significant overhead that limited throughput and increased latency for end users.

distributed-systems api-design

4 min

Cloudflare ↗

AI Security for Apps is now generally available

Organizations struggle to discover and secure AI-powered applications across their infrastructure, especially shadow AI deployments that teams spin up without central oversight, creating security blind spots.

security api-design

4 min

Cloudflare ↗

Active defense: introducing a stateful vulnerability scanner for APIs

Standard defensive security tools miss logic flaws and vulnerabilities in APIs because they lack understanding of stateful API interactions and business logic flows.

security api-design

3 min

Cloudflare ↗

Always-on detections: eliminating the WAF “log versus block” trade-off

Traditional WAFs force a trade-off between logging (risking missed attacks) and blocking (risking false positives), requiring extensive manual tuning to balance security coverage with availability.

security real-time-systems

4 min

Cloudflare ↗

Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans

Traditional bot-blocking approaches are insufficient for preventing account abuse (e.g., credential stuffing, fake account creation) because sophisticated attacks increasingly involve human-like behavior or actual humans, bypassing conventional bot detection.

security rate-limiting

3 min

Cloudflare ↗

Building a security overview dashboard for actionable insights

Security teams were overwhelmed by the volume of raw security data across Cloudflare's platform, making it difficult to prioritize and act on vulnerabilities and threats efficiently.

security observability

3 min

Cloudflare ↗

Complexity is a choice. SASE migrations shouldn’t take years.

Enterprise SASE (Secure Access Service Edge) migrations traditionally take 18+ months due to architectural complexity, requiring organizations to integrate networking and security across global infrastructure.

security distributed-systems

3 min

Cloudflare ↗

Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient

Tunnel layering in Cloudflare's WARP/One client caused MTU mismatches, leading to silently dropped oversized packets that degraded connectivity and resilience.

distributed-systems real-time-systems

4 min

Cloudflare ↗

Fixing request smuggling vulnerabilities in Pingora OSS deployments

Cloudflare's open-source Pingora proxy had request smuggling vulnerabilities when deployed as an ingress proxy, allowing attackers to exploit HTTP parsing discrepancies to bypass security controls and route malicious requests.

security api-design

3 min

Cloudflare ↗

From legacy architecture to Cloudflare One

Organizations struggle to migrate from legacy network security architectures to modern SASE (Secure Access Service Edge) solutions, facing risks from accumulated technical debt and complex dependencies in their existing infrastructure.

security microservices

3 min

Cloudflare ↗

From the endpoint to the prompt: a unified data security vision in Cloudflare One

Organizations face fragmented data security across endpoints, network traffic, cloud applications, and AI prompts, making it difficult to enforce consistent data loss prevention (DLP) policies as data flows through diverse channels including RDP sessions and AI copilots.

security api-design

3 min

Cloudflare ↗

How Automatic Return Routing solves IP overlap

Enterprises connecting multiple private networks via tunnels frequently encounter overlapping IP address ranges (e.g., multiple sites using 10.0.0.0/8), making traditional routing tables unable to determine which tunnel should receive return traffic.

distributed-systems security

4 min

Cloudflare ↗

Inside Gen 13: how we built our most powerful server yet

Cloudflare's existing server fleet could not keep pace with rapidly growing global traffic demands, requiring a new generation of hardware with significantly higher compute and network throughput.

distributed-systems load-balancing

4 min

Cloudflare ↗

Introducing Custom Regions for precision data control

Customers needed precise control over where their data is processed geographically to meet diverse compliance requirements (e.g., GDPR, data sovereignty laws), but existing pre-defined regional options were too coarse-grained to cover all regulatory and performance needs.

distributed-systems security

4 min

Cloudflare ↗

Investigating multi-vector attacks in Log Explorer

Security teams lacked a unified view across multiple Cloudflare datasets, making it difficult to identify and investigate multi-vector attacks that span different attack surfaces and log sources.

observability security

3 min

Cloudflare ↗

Launching Cloudflare’s Gen 13 servers: trading cache for cores for 2x edge compute performance

Cloudflare needed to significantly increase edge compute throughput per server but faced a tradeoff where high-core-count CPUs came with smaller per-core L3 cache, risking latency penalties for cache-dependent workloads.

distributed-systems caching

4 min

Cloudflare ↗

Powering the agents: Workers AI now runs large models, starting with Kimi K2.5

Running large AI models for agent workloads on edge infrastructure was cost-prohibitive and required significant inference stack optimization to serve models like Kimi K2.5 efficiently at scale.

ml-systems distributed-systems

4 min

Cloudflare ↗

Slashing agent token costs by 98% with RFC 9457-compliant error responses

AI agents hitting Cloudflare error pages received heavyweight HTML responses that consumed excessive tokens and required brittle parsing, making automated error handling inefficient and costly.

api-design ml-systems

4 min

Cloudflare ↗

Standing up for the open Internet: why we appealed Italy’s "Piracy Shield" fine

Italy's 'Piracy Shield' system forces Internet infrastructure providers like Cloudflare to block content at the network level without proper oversight or due process, leading to disproportionate overblocking of legitimate content.

security api-design

4 min

Cloudflare ↗

Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard

Organizations struggle with Internet-facing blind spots in their attack surface, lacking continuous visibility into security gaps and risk exposures across their external-facing assets.

security

4 min

Dropbox ↗

Building the future: highlights from Dropbox’s 2025 summer intern class

This article is not a technical engineering blog post — it covers Dropbox's 2025 summer intern program highlights, focusing on professional growth, innovation culture, and community building rather than addressing a specific engineering challenge.

microservices

3 min

Dropbox ↗

Engineering VP Josh Clemm on how we use knowledge graphs, MCP, and DSPy in Dash

Enterprise search and AI assistant products like Dropbox Dash need to connect disparate data sources and optimize AI-driven retrieval, but naively querying across siloed data with LLMs leads to poor relevance and brittle prompt engineering.

search ml-systems

3 min

Dropbox ↗

Half-Quadratic Quantization of large machine learning models

Large machine learning models require significant memory and compute resources, making deployment and inference expensive and slow, especially in resource-constrained environments.

ml-systems storage-systems

3 min

Dropbox ↗

How Dash uses context engineering for smarter AI

Dropbox Dash's AI agent struggled with effectiveness when naively providing all available context to the model, leading to degraded performance as irrelevant information diluted the signal needed for accurate, agentic AI responses.

ml-systems search

3 min

Dropbox ↗

How low-bit inference enables efficient AI

Running AI inference for products like Dropbox Dash at scale is expensive and resource-intensive, requiring efficient use of compute and memory to make the product accessible to a broad user base.

ml-systems storage-systems

3 min

Dropbox ↗

How we optimized Dash's relevance judge with DSPy

Manual prompt engineering for Dropbox Dash's relevance judge was unreliable, hard to measure, and costly—making it difficult to systematically improve task performance in production.

ml-systems search

3 min

Dropbox ↗

Inside the feature store powering real-time AI in Dropbox Dash

Dropbox Dash needs to rank and retrieve relevant context across a user's work in real time, requiring low-latency access to precomputed and real-time features for AI-driven search and recommendation models.

ml-systems real-time-systems

3 min

Dropbox ↗

Insights from our executive roundtable on AI and engineering productivity

Engineering organizations face open questions about how to effectively integrate AI coding tools (like Claude Code and Cursor) into developer workflows and where these tools can have the most measurable impact on productivity.

ml-systems microservices

4 min

Dropbox ↗

Using LLMs to amplify human labeling and improve Dash search relevance

Dash's search ranking models required large volumes of high-quality labeled relevance data to train effectively, but human labeling alone was too slow and expensive to scale to the needed coverage.

search ml-systems

3 min

Dropbox ↗

With Mobius Labs' Aana models, we're bringing deeper multimodal understanding to Dropbox Dash

Dropbox Dash needed deeper understanding of multimodal content (photos and videos) across user files, but processing diverse media types at Dropbox's scale posed efficiency and architectural challenges.

ml-systems search

3 min

Meta ↗

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

Connecting thousands of GPUs across multiple data centers and regions for gigawatt-scale AI training clusters requires seamlessly bridging different network fabrics, which creates massive networking and interconnect challenges.

distributed-systems ml-systems

5 min

Meta ↗

FFmpeg at Meta: Media Processing at Scale

Meta needed to handle massive-scale media processing (encoding, transcoding, filtering) across its family of apps, requiring efficient orchestration of complex audio/video pipelines using FFmpeg at an unprecedented scale.

storage-systems distributed-systems

5 min

Meta ↗

Friend Bubbles: Enhancing Social Discovery on Facebook Reels

Facebook Reels needed a way to enhance social discovery by surfacing content that friends have interacted with, requiring real-time computation of relationship strength and ranking of friend-engaged content at massive scale.

ml-systems real-time-systems

5 min

Meta ↗

How Advanced Browsing Protection Works in Messenger

Messenger needed to protect user privacy when clicking links in chats while still detecting and warning users about malicious URLs, creating a tension between link safety scanning and end-to-end privacy.

security messaging-queues

5 min

Meta ↗

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Meta's large-scale infrastructure relies on jemalloc for memory allocation, but the codebase had accumulated maintenance burden and needed modernization to keep pace with evolving hardware and workload demands.

storage-systems distributed-systems

5 min

Meta ↗

Patch Me If You Can: AI Codemods for Secure-by-Default Android Apps

Updating security-related APIs across millions of lines of code and thousands of engineers is extremely difficult at scale, especially when a single class of mobile vulnerability can be replicated across hundreds of locations in an Android codebase.

security ml-systems

5 min

Meta ↗

RCCLX: Innovating GPU Communications on AMD Platforms

GPU-to-GPU communication performance on AMD platforms was insufficient for Meta's evolving AI model training workloads, and the standard RCCL library didn't meet the performance and flexibility requirements of their internal workloads.

distributed-systems ml-systems

5 min

Meta ↗

Ranking Engineer Agent (REA): The Autonomous AI Agent Accelerating Meta’s Ads Ranking Innovation

Meta's ads ranking ML experimentation lifecycle required extensive manual intervention from engineers for hypothesis generation, training job launches, failure debugging, and result iteration, slowing down the pace of ranking model innovation.

ml-systems microservices

5 min

Meta ↗

The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It

Agentic (AI-driven) software development produces and ships code so fast that traditional testing frameworks cannot keep pace, leaving bugs uncaught as they land in rapidly evolving codebases.

ml-systems observability

5 min

Netflix ↗

AV1 — Now Powering 30% of Netflix Streaming

Delivering high-quality streaming video across diverse devices and varying network conditions requires efficient video encoding; legacy codecs like H.264 and VP9 were limiting compression efficiency, consuming more bandwidth for equivalent visual quality.

real-time-systems storage-systems

5 min

Netflix ↗

Automating RDS Postgres to Aurora Postgres Migration

Netflix's relational database ecosystem lacked standardization, with databases spread across RDS Postgres and other technologies, leading to inconsistent functionality, suboptimal performance, and higher total cost of ownership.

databases distributed-systems

5 min

Netflix ↗

How Temporal Powers Reliable Cloud Operations at Netflix

Netflix needed reliable orchestration for business-critical cloud operations across teams like Open Connect CDN and Live reliability, but faced operational challenges as Temporal adoption grew since 2021.

distributed-systems microservices

5 min

Netflix ↗

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

Netflix needed scalable, deep machine-level understanding of every piece of content across an expanding catalog (including live events and podcasts) to power recommendations and discovery, but building separate models per content type and modality doesn't scale.

ml-systems microservices

5 min

Netflix ↗

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Netflix needed to spin up hundreds of containers in seconds to serve streaming traffic, but after modernizing their container runtime, they hit an unexpected performance bottleneck rooted in CPU architecture that impaired container scaling efficiency.

distributed-systems real-time-systems

5 min

Netflix ↗

Netflix Live Origin

Netflix needed a custom origin server to bridge its cloud-based live streaming pipelines with its CDN (Open Connect), handling the unique challenges of live content delivery such as low-latency requirements, reliability, and the real-time nature of live streams compared to on-demand content.

real-time-systems distributed-systems

5 min

Netflix ↗

Optimizing Recommendation Systems with JDK’s Vector API

Netflix's Ranker service had a video serendipity scoring feature (computing how different a title is from a user's watch history) consuming ~7.5% of total CPU per node, creating a significant performance bottleneck at their enormous scale.

ml-systems real-time-systems

5 min

Netflix ↗

Scaling Global Storytelling: Modernizing Localization Analytics at Netflix

Netflix's localization analytics infrastructure (tracking dubbing, subtitling, and translation across hundreds of languages and regions) could not keep pace with the rapidly growing scale of global content, making it difficult to derive timely insights for content localization decisions.

databases distributed-systems

5 min

Netflix ↗

Scaling LLM Post-Training at Netflix

Generic pre-trained LLMs lack the domain-specific alignment needed for Netflix's production use cases in recommendation, personalization, and search, and the post-training pipeline to fine-tune them doesn't scale efficiently across multiple domain constraints and reliability requirements.

ml-systems distributed-systems

5 min

Netflix ↗

The AI Evolution of Graph Search at Netflix

Netflix's Graph Search platform for federated enterprise data required users to write structured queries, limiting accessibility and ease of use despite the system being scalable and configurable.

search ml-systems

5 min