Archives — Distributed Readings

AWS ↗

Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events

How to design systems that can recover from ransomware and destructive cyberattacks when backups, credentials, and infrastructure components have been compromised.

security storage-systems

4 min

Cloudflare ↗

Announcing Claude Compliance API support with Cloudflare CASB

Security teams needed visibility and compliance monitoring of Claude Enterprise API usage across their organization without leaving their existing security infrastructure.

security api-design

3 min

Cloudflare ↗

Project Glasswing: what Mythos showed us

Determining whether security-focused LLMs can effectively identify vulnerabilities in live production infrastructure code at scale.

security ml-systems

4 min

Dropbox ↗

Introducing Nova, our internal platform for coding agents

Enabling engineers to run multiple concurrent coding sessions and integrating AI agents into automated internal workflows at scale.

microservices api-design

3 min

Google ↗

Agents CLI in Agent Platform: create to production in one CLI

Developers face high context overhead and token waste when scaffolding AI agents locally and struggle to bridge the gap between development environments and production-grade deployment on Google Cloud.

api-design microservices

5 min

Google ↗

Announcing Genkit Middleware: Intercept, extend, and harden your agentic apps

Developers need a way to reliably control, monitor, and extend AI model generation calls in production agentic applications without modifying core business logic.

api-design ml-systems

5 min

Google ↗

Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith

Converting a brittle, monolithic sales research AI prototype into a production-ready agent that eliminates silent failures, fragile parsing, and lacks observability.

microservices observability

5 min

Google Cloud ↗

Five must-have guides to move agents into production with Gemini Enterprise Agent Platform

Deploying and managing AI agents at scale in production requires infrastructure for state management, security governance, and complex workflow orchestration that goes beyond demo implementations.

distributed-systems security

5 min

Google Cloud ↗

From keynote to the terminal: Join our Next ‘26 developer livestreams

Google Cloud needed to bridge the gap between high-level keynote announcements and practical implementation details that developers could immediately apply.

general observability

5 min

Google Cloud ↗

How BASF manages thousands of supply chain decisions with AlphaEvolve’s agentic algorithms

BASF needed to manage and optimize thousands of interdependent supply chain decisions across 180 global production sites where weather and regulatory changes can cause cascading disruptions in a two-year production pipeline.

distributed-systems ml-systems

5 min

Google Cloud ↗

Introducing Gemini Enterprise Agent Platform, powering the next wave of agents

Building safe, reliable, and autonomous agents that can act independently across multiple enterprise systems while maintaining security, governance, and reliability guardrails.

ml-systems security

5 min

Google Cloud ↗

Next '26 Hands-On: 10 Codelabs to Build Featured Tech

How to help developers transition from understanding AI concepts to building and maintaining production agentic systems in cloud environments.

observability microservices

5 min

Google Cloud ↗

Next ‘26: Redefining security for the AI era with Google Cloud and Wiz

Organizations need to secure their AI systems and infrastructure against emerging AI-era threats while maintaining the ability to leverage AI's potential at scale.

security distributed-systems

5 min

Google Cloud ↗

Shipping features to production just got easier with new feature flags in AppLifecycle Manager

Development teams struggle to safely deploy code to production while managing the risk of releasing features to all users simultaneously, especially as AI accelerates code generation faster than safe deployment practices can keep up.

devops observability

5 min

Google Cloud ↗

What’s new with the Cross-Cloud Network at Next ‘26

Enabling seamless connectivity, governance, and security across multi-agent AI systems and core applications distributed globally at planet scale.

distributed-systems microservices

5 min

Spotify ↗

Better Experiments with LLM Evals — A funnel, not a fork

Efficiently evaluating and validating LLM-generated outputs at scale during experimentation without manual review bottlenecks.

ml-systems observability

4 min

AWS ↗

Streaming CloudWatch metrics to VPC-based OpenTelemetry collectors using Lambda

Streaming CloudWatch metrics to internal VPC-based OpenTelemetry collectors without exposing them to the internet.

observability serverless

4 min

Airbnb ↗

Viaduct 1.0 and the future of Airbnb’s data mesh

Airbnb needed to transition Viaduct from an internal-only data mesh tool to a production-ready, community-driven platform with a stable public API.

api-design distributed-systems

5 min

Cloudflare ↗

Browser Run: now running on Cloudflare Containers, it’s faster and more scalable

Browser Run needed higher usage limits, better performance, and improved reliability while increasing development velocity for their browser automation service.

distributed-systems load-balancing

3 min

Cloudflare ↗

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.

databases observability

4 min

Meta ↗

Migrating Data Ingestion Systems at Meta Scale

Meta needed to migrate their legacy data ingestion system to a new architecture while maintaining reliability and consistency for real-time social graph snapshots at massive scale.

distributed-systems storage-systems

5 min

Airbnb ↗

Monitoring reliably at scale

Designing monitoring and observability systems that remain functional and reliable even when the core infrastructure they monitor is failing or degraded.

observability distributed-systems

5 min

Cloudflare ↗

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Rapidly detect, investigate, and mitigate a critical Linux kernel privilege escalation vulnerability across a global edge computing fleet without impacting customers.

security distributed-systems

4 min

Cloudflare ↗

When DNSSEC goes wrong: how we responded to the .de TLD outage

When DENIC published invalid DNSSEC signatures for the .de TLD, DNS resolvers like 1.1.1.1 faced a critical decision: reject all .de domain queries due to signature validation failures or serve potentially stale cached responses to maintain availability.

caching distributed-systems

4 min

Netflix ↗

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix needed to manage the lifecycle of machine learning models across multiple domains and teams at scale, moving beyond their original single-domain personalization focus.

ml-systems microservices

5 min

Netflix ↗

Evaluating Netflix Show Synopses with LLM-as-a-Judge

Netflix needed to automatically evaluate the quality and relevance of show synopses at scale to improve member discovery and engagement.

ml-systems api-design

5 min

Netflix ↗

Scaling Camera File Processing at Netflix

Netflix needed to build a scalable, flexible media file processing pipeline that could handle diverse camera formats, workflows, and production requirements while maintaining quick turnaround times for global content production.

microservices distributed-systems

5 min

Netflix ↗

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events

Netflix needed to optimize bandwidth utilization and video quality for live streaming events at global scale by moving from constant bitrate to variable bitrate encoding.

real-time-systems distributed-systems

5 min

Netflix ↗

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Netflix needed to build reliable operations infrastructure to support live streaming at massive scale, going from one show per month to nine shows per day with tens of millions of concurrent viewers.

microservices observability

5 min

Spotify ↗

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations (Honk, Part 4)

Spotify needed to migrate thousands of downstream datasets when source datasets changed structure, without manually updating each consumer application.

data-pipelines microservices

4 min

Stripe ↗

10 things we learned building for the first generation of agentic commerce

Building reliable payment and commerce systems that can handle autonomous AI agents as buyers, which introduce new failure modes and consistency requirements not present in traditional e-commerce.

api-design distributed-systems

4 min

Stripe ↗

Analyzing first-party fraud trends: Account, free trial, and refund abuse

Detecting and preventing first-party fraud at scale across a payment network where legitimate users abuse policies through multiple accounts, free trial cycling, and refund exploitation.

ml-systems security

4 min

Stripe ↗

How agents, digital wallets, and trust are rewriting checkout

Understanding and optimizing the checkout conversion funnel across diverse ecommerce businesses to identify what drives successful transactions in modern online payment flows.

api-design real-time-systems

4 min

Stripe ↗

Testing the impact of Adaptive Pricing across 1.5M subscription checkout sessions

How to automatically localize subscription pricing across 150+ countries while measuring the business impact of dynamic pricing on conversion and lifetime value.

api-design observability

4 min

Airbnb ↗

Skipper: Building Airbnb’s embedded workflow engine

How to build a durable workflow execution engine that can recover from failures mid-process without losing state or duplicating work.

distributed-systems databases

5 min

Cloudflare ↗

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

Cloudflare needed to make their global edge infrastructure more resilient to configuration changes and prevent widespread outages caused by unsafe deployments.

distributed-systems observability

4 min

Cloudflare ↗

Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions

How to measure, analyze, and publicly report on Internet disruptions caused by geopolitical events, infrastructure attacks, and power outages in real-time across global networks.

observability distributed-systems

4 min

AWS ↗

Real-time analytics: Oldcastle integrates Infor with Amazon Aurora and Amazon Quick Sight

Oldcastle needed to overcome the limitations of traditional ERP reporting to enable real-time analytics and dashboards for their Infor ERP system.

databases real-time-systems

5 min

Airbnb ↗

Building a fault-tolerant metrics storage system at Airbnb

Building a metrics storage system capable of ingesting 50 million samples per second while reliably storing 2.5 petabytes of time series data at scale.

observability storage-systems

5 min

Cloudflare ↗

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Rust panics in Cloudflare Workers were fatal and poisoned the entire worker instance, making applications unreliable when unhandled errors occurred.

security observability

4 min

Cloudflare ↗

Orchestrating AI Code Review at scale

Cloudflare needed to scale code review processes across their engineering organization while maintaining code quality and security standards without overwhelming human reviewers.

ml-systems api-design

3 min

Cloudflare ↗

The AI engineering stack we built internally — on the platform we ship

Cloudflare needed to build an internal AI engineering stack that could handle massive scale (20 million requests, 241 billion tokens) while dogfooding their own platform products.

api-design ml-systems

4 min

Cloudflare ↗

Agents Week: network performance update

Cloudflare needed to improve request handling performance across its global network to maintain competitive advantage over other CDNs.

distributed-systems load-balancing

4 min

Cloudflare ↗

Browser Run: give your agents a browser

AI agents needed a way to interact with browsers at scale while maintaining visibility and control over automated actions, requiring higher concurrency and real-time debugging capabilities.

real-time-systems ml-systems

3 min

Cloudflare ↗

Building the foundation for running extra-large language models

How to efficiently run inference for extra-large language models on edge infrastructure while maintaining low latency and high throughput across distributed Cloudflare servers.

ml-systems distributed-systems

4 min

Cloudflare ↗

Introducing Agent Lee - a new interface to the Cloudflare stack

Users had to manually navigate multiple tabs and interfaces within the Cloudflare dashboard to troubleshoot issues and manage their infrastructure, creating friction in the workflow.

api-design security

4 min

Cloudflare ↗

Introducing the Agent Readiness score. Is your site agent-ready?

Website owners needed a way to measure and understand how well their sites support AI agents and web crawlers for indexing and integration.

api-design observability

4 min

Meta ↗

Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale

Meta needed to automatically identify and remediate performance inefficiencies across their massive infrastructure to reduce power consumption and free up engineering capacity.

observability distributed-systems

5 min

Airbnb ↗

Building a high-volume metrics pipeline with OpenTelemetry and vmagent

Migrating a large-scale metrics pipeline from StatsD to OpenTelemetry while handling production traffic volumes without losing data or blocking dependent systems.

observability distributed-systems

5 min

Meta ↗

How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines

AI coding assistants were ineffective at making useful edits in large-scale data pipelines because they lacked sufficient understanding of complex, multi-repository codebases spanning multiple languages and thousands of files.

distributed-systems ml-systems

5 min

Meta ↗

Trust But Canary: Configuration Safety at Scale

Safely deploying configuration changes at scale while minimizing the risk of widespread failures caused by faulty configurations.

observability distributed-systems

5 min

AWS ↗

Automate safety monitoring with computer vision and generative AI

Detecting safety hazards in real-time across hundreds of distributed operational sites using video feeds while maintaining low latency and managing the computational complexity of processing multiple camera streams.

real-time-systems distributed-systems

5 min

AWS ↗

How Generali Malaysia optimizes operations with Amazon EKS

Generali Malaysia needed to optimize Kubernetes operations on AWS while reducing operational overhead, managing costs, and improving security posture.

distributed-systems security

4 min

Airbnb ↗

What COVID did to our forecasting models (and what we built to handle the next shock)

Building forecasting models that remain accurate during sudden market shocks like a global pandemic, where historical data no longer predicts future outcomes.

ml-systems observability

5 min

Cloudflare ↗

A one-line Kubernetes fix that saved 600 hours a year

Cloudflare's Atlantis instance took 30 minutes to restart due to a Kubernetes volume permission bottleneck.

observability storage-systems

4 min

Cloudflare ↗

Cloudflare Client-Side Security: smarter detection, now open to everyone

Detecting sophisticated client-side security threats like zero-day exploits while minimizing false positives in real-time across millions of requests.

security ml-systems

4 min

Cloudflare ↗

Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver

How to design a public DNS resolver that prioritizes user privacy while maintaining performance and trustworthiness at scale.

security distributed-systems

4 min

Dropbox ↗

Improving storage efficiency in Magic Pocket, our immutable blob store

Dropbox needed to improve storage efficiency and resilience in Magic Pocket, their immutable blob store, when handling variable and changing workloads.

storage-systems observability

3 min

Dropbox ↗

Reducing our monorepo size to improve developer velocity

Monorepo growth was causing increased build times, slower dependency resolution, and reduced developer velocity as the codebase expanded.

general observability

3 min

Meta ↗

KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure

Meta needed to automatically optimize low-level infrastructure and kernel-level parameters for AI ranking models to improve performance without manual tuning.

ml-systems distributed-systems

5 min

Meta ↗

Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads

Meta needed to scale their ads ranking models to LLM-scale complexity and size while maintaining inference latency requirements for real-time ad serving.

ml-systems real-time-systems

5 min

LinkedIn ↗

Introducing Northguard and Xinfra: scalable log storage at LinkedIn

LinkedIn's logging infrastructure couldn't scale cost-effectively to handle the massive volume of operational logs across thousands of services.

observability storage-systems

3 min

AWS ↗

6,000 AWS accounts, three people, one platform: Lessons learned

Managing 6,000 AWS accounts for a multi-tenant serverless SaaS platform with only three people created massive operational challenges around automation, observability, and cost management at scale.

distributed-systems microservices

4 min

AWS ↗

AI-powered event response for Amazon EKS

Responding to operational events in Amazon EKS clusters is often manual, slow, and requires deep expertise, making it difficult to handle incidents at scale across complex Kubernetes environments.

observability ml-systems

3 min

AWS ↗

Architecting conversational observability for cloud applications

Diagnosing and resolving issues in complex Kubernetes clusters is slow and requires expert knowledge, leading to high Mean Time to Recovery (MTTR) and heavy reliance on specialized engineers for root cause analysis.

observability ml-systems

4 min

Airbnb ↗

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

Airbnb's reliance on multiple third-party observability vendors resulted in inconsistent data, fragmented developer experiences, and limitations in cost-effectiveness and reliability at their scale.

observability microservices

5 min

Airbnb ↗

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

Airbnb's Observability as Code alert development process had excessively long development cycles (weeks) due to cumbersome code review workflows, slowing down engineers' ability to create and iterate on alerts at scale across thousands of services.

observability microservices

5 min

Cloudflare ↗

Building a security overview dashboard for actionable insights

Security teams were overwhelmed by the volume of raw security data across Cloudflare's platform, making it difficult to prioritize and act on vulnerabilities and threats efficiently.

security observability

3 min

Cloudflare ↗

Investigating multi-vector attacks in Log Explorer

Security teams lacked a unified view across multiple Cloudflare datasets, making it difficult to identify and investigate multi-vector attacks that span different attack surfaces and log sources.

observability security

3 min

Meta ↗

The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It

Agentic (AI-driven) software development produces and ships code so fast that traditional testing frameworks cannot keep pace, leaving bugs uncaught as they land in rapidly evolving codebases.

ml-systems observability

5 min