Archives — Distributed Readings

Dropbox ↗

Introducing Nova, our internal platform for coding agents

Enabling engineers to run multiple concurrent coding sessions and integrating AI agents into automated internal workflows at scale.

microservices api-design

3 min

Google ↗

Agents CLI in Agent Platform: create to production in one CLI

Developers face high context overhead and token waste when scaffolding AI agents locally and struggle to bridge the gap between development environments and production-grade deployment on Google Cloud.

api-design microservices

5 min

Google ↗

An important update: Transitioning Gemini CLI to Antigravity CLI

Google needed to unify fragmented AI terminal tooling by consolidating the community-focused Gemini CLI into a more scalable, agent-first platform capable of handling complex multi-agent workflows.

api-design microservices

5 min

Google ↗

Empowering Service Providers and Hardware Partners with Gemini for Home

How can Google enable third-party service providers and hardware manufacturers to build intelligent smart home experiences without requiring deep AI/ML expertise or significant R&D investment?

api-design ml-systems

5 min

Google ↗

Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith

Converting a brittle, monolithic sales research AI prototype into a production-ready agent that eliminates silent failures, fragile parsing, and lacks observability.

microservices observability

5 min

Google Cloud ↗

Create Expert Content: Deploying a Multi-Agent System with Terraform and Cloud Run

Automating the transformation of raw community signals into reliable technical guidance at scale using multiple specialized agents.

microservices api-design

5 min

Google Cloud ↗

Next '26 Hands-On: 10 Codelabs to Build Featured Tech

How to help developers transition from understanding AI concepts to building and maintaining production agentic systems in cloud environments.

observability microservices

5 min

Google Cloud ↗

What Google I/O '26 means for developing agents on Google Cloud

Developers needed a unified, secure way to build AI agents locally and deploy them to Google Cloud with standardized protocols and tooling.

api-design microservices

5 min

Google Cloud ↗

What’s new with the Cross-Cloud Network at Next ‘26

Enabling seamless connectivity, governance, and security across multi-agent AI systems and core applications distributed globally at planet scale.

distributed-systems microservices

5 min

AWS ↗

Building hybrid multi-tenant architecture for stateful services on AWS

Building a multi-tenant architecture that isolates tenants without requiring separate AWS accounts while maintaining stateful service deployments.

load-balancing distributed-systems

5 min

Netflix ↗

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix needed to manage the lifecycle of machine learning models across multiple domains and teams at scale, moving beyond their original single-domain personalization focus.

ml-systems microservices

5 min

Netflix ↗

Scaling ArchUnit with Nebula ArchRules

Netflix needed a way to enforce consistent architectural patterns and build standards across tens of thousands of Java repositories in their polyrepo strategy.

microservices general

5 min

Netflix ↗

Scaling Camera File Processing at Netflix

Netflix needed to build a scalable, flexible media file processing pipeline that could handle diverse camera formats, workflows, and production requirements while maintaining quick turnaround times for global content production.

microservices distributed-systems

5 min

Netflix ↗

State of Routing in Model Serving

Netflix needed to design a domain-independent traffic routing system for their ML model serving infrastructure that could handle personalized experiences at scale across multiple domains while maintaining high availability.

microservices load-balancing

5 min

Netflix ↗

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Netflix needed to build reliable operations infrastructure to support live streaming at massive scale, going from one show per month to nine shows per day with tens of millions of concurrent viewers.

microservices observability

5 min

Spotify ↗

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations (Honk, Part 4)

Spotify needed to migrate thousands of downstream datasets when source datasets changed structure, without manually updating each consumer application.

data-pipelines microservices

4 min

Spotify ↗

Building a Natural Language Interface to the Spotify Ads API with Claude Code Plugins

Making the Spotify Ads API accessible to non-technical users and reducing friction in ad campaign management by enabling natural language interaction instead of requiring direct API integration.

api-design microservices

4 min

Spotify ↗

Our Multi-Agent Architecture for Smarter Advertising

Spotify needed to optimize ad targeting and delivery at scale by coordinating multiple specialized systems to make smarter advertising decisions rather than relying on monolithic ad selection logic.

microservices ml-systems

4 min

Stripe ↗

Insights from Shoptalk 2026: How agents are changing retail

How to integrate AI agents into ecommerce platforms to enable seamless product discovery and checkout across embedded and third-party surfaces.

api-design real-time-systems

4 min

AWS ↗

Deloitte optimizes EKS environment provisioning and achieves 89% faster testing environments using Amazon EKS and vCluster

Deloitte needed to significantly reduce the time required to provision and spin up testing environments for their Kubernetes workloads.

distributed-systems microservices

3 min

Cloudflare ↗

Agents can now create Cloudflare accounts, buy domains, and deploy

How to enable autonomous agents to programmatically create Cloudflare accounts, purchase domains, and deploy infrastructure without manual dashboard interaction or credential handling.

api-design security

4 min

Cloudflare ↗

Introducing Dynamic Workflows: durable execution that follows the tenant

Enable multi-tenant platforms to execute millions of unique, durable workflows without incurring significant idle infrastructure costs.

distributed-systems microservices

4 min

AWS ↗

Modernizing KYC with AWS serverless solutions and agentic AI for financial services

Traditional rule-based KYC (Know Your Customer) systems lack the autonomous decision-making capability and real-time validation speed needed for modern financial services compliance operations.

serverless real-time-systems

5 min

AWS ↗

PACIFIC enables multi-tenant, sovereign product carbon footprint exchange on the Catena-X data space using AWS

Enable multiple independent organizations to securely exchange Product Carbon Footprint (PCF) data within a shared data space while maintaining data sovereignty and tenant isolation.

microservices security

4 min

Cloudflare ↗

Cloudflare Email Service: now in public beta. Ready for your agents

Enabling AI agents to send, receive, and process email natively as a multi-channel communication medium without requiring developers to build custom email infrastructure.

api-design microservices

4 min

Cloudflare ↗

Cloudflare’s AI Platform: an inference layer designed for agents

Developers needed a unified way to access multiple AI model providers without managing separate integrations and API contracts for each one.

api-design microservices

4 min

AWS ↗

Build a multi-tenant configuration system with tagged storage patterns

Building a scalable multi-tenant configuration service that maintains strict tenant isolation while supporting real-time updates without cache staleness or downtime.

caching storage-systems

5 min

Cloudflare ↗

How we built Organizations to help enterprises manage Cloudflare at scale

Cloudflare needed to enable enterprise customers to manage multiple accounts and resources under a unified organizational structure with centralized authorization and access control.

api-design security

4 min

AWS ↗

Architecting for agentic AI development on AWS

AI agents struggle to iterate rapidly on system design and codebases due to architectural patterns that limit their ability to understand, modify, and validate applications effectively.

microservices serverless

5 min

Cloudflare ↗

Introducing EmDash — the spiritual successor to WordPress that solves plugin security

WordPress plugins pose significant security risks because they run with unrestricted access to the entire system, requiring a safer plugin architecture that isolates untrusted code.

security microservices

4 min

LinkedIn ↗

Securing every Kubernetes workload at scale

Securing thousands of Kubernetes workloads across a large-scale infrastructure requires automated and consistent security policies.

security microservices

3 min

LinkedIn ↗

The LinkedIn Generative AI Application Tech Stack: Personaliza...

Building personalized generative AI features at LinkedIn's scale required a robust and reliable application infrastructure that could serve millions of users.

ml-systems microservices

3 min

AWS ↗

6,000 AWS accounts, three people, one platform: Lessons learned

Managing 6,000 AWS accounts for a multi-tenant serverless SaaS platform with only three people created massive operational challenges around automation, observability, and cost management at scale.

distributed-systems microservices

4 min

AWS ↗

AI-powered event response for Amazon EKS

Responding to operational events in Amazon EKS clusters is often manual, slow, and requires deep expertise, making it difficult to handle incidents at scale across complex Kubernetes environments.

observability ml-systems

3 min

AWS ↗

BASF Digital Farming builds a STAC-based solution on Amazon EKS

BASF Digital Farming needed a scalable way to catalog, discover, and serve large volumes of spatiotemporal geospatial data (satellite imagery, crop data) for their xarvio crop optimization platform, and their existing infrastructure struggled with the scale and query patterns of this data.

microservices storage-systems

4 min

AWS ↗

Build priority-based message processing with Amazon MQ and AWS App Runner

Standard message queues process messages in FIFO order, lacking the ability to prioritize urgent messages over lower-priority ones, which can cause critical tasks to wait behind less important work during high load.

messaging-queues real-time-systems

5 min

AWS ↗

Digital Transformation at Santander: How Platform Engineering is Revolutionizing Cloud Infrastructure

Santander struggled to manage cloud infrastructure supporting billions of daily transactions across 200+ critical systems, facing complexity and scalability challenges in their banking operations.

distributed-systems microservices

5 min

AWS ↗

How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters

Salesforce's Cluster Autoscaler could not efficiently scale and manage node provisioning across their fleet of 1,000+ EKS clusters, likely suffering from slow scaling decisions, suboptimal bin-packing, and operational complexity at massive scale.

distributed-systems load-balancing

4 min

AWS ↗

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

Organizations struggle to design well-architected cloud systems that balance cost optimization, security, reliability, and performance efficiency across increasingly complex AWS environments including AI-powered workloads.

security microservices

5 min

AWS ↗

Mastering millisecond latency and millions of events: The event-driven architecture behind the Amazon Key Suite

The Amazon Key Suite had a tightly coupled monolithic architecture that struggled with reliability and scalability when processing millions of events at millisecond latency requirements across multiple service integrations.

microservices messaging-queues

5 min

Airbnb ↗

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

Airbnb's reliance on multiple third-party observability vendors resulted in inconsistent data, fragmented developer experiences, and limitations in cost-effectiveness and reliability at their scale.

observability microservices

5 min

Airbnb ↗

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

Airbnb's Observability as Code alert development process had excessively long development cycles (weeks) due to cumbersome code review workflows, slowing down engineers' ability to create and iterate on alerts at scale across thousands of services.

observability microservices

5 min

Airbnb ↗

Pay As a Local

Airbnb relied primarily on card payments across 220+ global markets, but many users preferred local payment methods, causing checkout friction, reduced accessibility, and lower adoption in key markets.

api-design microservices

5 min

Airbnb ↗

Safeguarding Dynamic Configuration Changes at Scale

Dynamic configuration changes at scale can cause widespread outages if rolled out unsafely—a single bad config update can immediately affect all services and requests without the safety net of a gradual deployment process.

distributed-systems microservices

5 min

Cloudflare ↗

From legacy architecture to Cloudflare One

Organizations struggle to migrate from legacy network security architectures to modern SASE (Secure Access Service Edge) solutions, facing risks from accumulated technical debt and complex dependencies in their existing infrastructure.

security microservices

3 min

Cloudflare ↗

Powering the agents: Workers AI now runs large models, starting with Kimi K2.5

Running large AI models for agent workloads on edge infrastructure was cost-prohibitive and required significant inference stack optimization to serve models like Kimi K2.5 efficiently at scale.

ml-systems distributed-systems

4 min

Dropbox ↗

Building the future: highlights from Dropbox’s 2025 summer intern class

This article is not a technical engineering blog post — it covers Dropbox's 2025 summer intern program highlights, focusing on professional growth, innovation culture, and community building rather than addressing a specific engineering challenge.

microservices

3 min

Dropbox ↗

Insights from our executive roundtable on AI and engineering productivity

Engineering organizations face open questions about how to effectively integrate AI coding tools (like Claude Code and Cursor) into developer workflows and where these tools can have the most measurable impact on productivity.

ml-systems microservices

4 min

Meta ↗

Ranking Engineer Agent (REA): The Autonomous AI Agent Accelerating Meta’s Ads Ranking Innovation

Meta's ads ranking ML experimentation lifecycle required extensive manual intervention from engineers for hypothesis generation, training job launches, failure debugging, and result iteration, slowing down the pace of ranking model innovation.

ml-systems microservices

5 min

Netflix ↗

How Temporal Powers Reliable Cloud Operations at Netflix

Netflix needed reliable orchestration for business-critical cloud operations across teams like Open Connect CDN and Live reliability, but faced operational challenges as Temporal adoption grew since 2021.

distributed-systems microservices

5 min

Netflix ↗

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

Netflix needed scalable, deep machine-level understanding of every piece of content across an expanding catalog (including live events and podcasts) to power recommendations and discovery, but building separate models per content type and modality doesn't scale.

ml-systems microservices

5 min

Netflix ↗

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Netflix needed to spin up hundreds of containers in seconds to serve streaming traffic, but after modernizing their container runtime, they hit an unexpected performance bottleneck rooted in CPU architecture that impaired container scaling efficiency.

distributed-systems real-time-systems

5 min

Netflix ↗

Optimizing Recommendation Systems with JDK’s Vector API

Netflix's Ranker service had a video serendipity scoring feature (computing how different a title is from a user's watch history) consuming ~7.5% of total CPU per node, creating a significant performance bottleneck at their enormous scale.

ml-systems real-time-systems

5 min