AWS

Architecting for agentic AI development on AWS

AI agents struggle to iterate rapidly on system design and codebases due to architectural patterns that limit their ability to understand, modify, and validate applications effectively.

microservices serverless
5 min
AWS

AI-powered event response for Amazon EKS

Responding to operational events in Amazon EKS clusters is often manual, slow, and requires deep expertise, making it difficult to handle incidents at scale across complex Kubernetes environments.

observability ml-systems
3 min
Airbnb

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

Airbnb's reliance on multiple third-party observability vendors resulted in inconsistent data, fragmented developer experiences, and limitations in cost-effectiveness and reliability at their scale.

observability microservices
5 min
Cloudflare

Powering the agents: Workers AI now runs large models, starting with Kimi K2.5

Running large AI models for agent workloads on edge infrastructure was cost-prohibitive and required significant inference stack optimization to serve models like Kimi K2.5 efficiently at scale.

ml-systems distributed-systems
4 min
Meta

Ranking Engineer Agent (REA): The Autonomous AI Agent Accelerating Meta’s Ads Ranking Innovation

Meta's ads ranking ML experimentation lifecycle required extensive manual intervention from engineers for hypothesis generation, training job launches, failure debugging, and result iteration, slowing down the pace of ranking model innovation.

ml-systems microservices
5 min
Cloudflare

From legacy architecture to Cloudflare One

Organizations struggle to migrate from legacy network security architectures to modern SASE (Secure Access Service Edge) solutions, facing risks from accumulated technical debt and complex dependencies in their existing infrastructure.

security microservices
3 min
Airbnb

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

Airbnb's Observability as Code alert development process had excessively long development cycles (weeks) due to cumbersome code review workflows, slowing down engineers' ability to create and iterate on alerts at scale across thousands of services.

observability microservices
5 min
Netflix

Optimizing Recommendation Systems with JDK’s Vector API

Netflix's Ranker service had a video serendipity scoring feature (computing how different a title is from a user's watch history) consuming ~7.5% of total CPU per node, creating a significant performance bottleneck at their enormous scale.

ml-systems real-time-systems
5 min
AWS

6,000 AWS accounts, three people, one platform: Lessons learned

Managing 6,000 AWS accounts for a multi-tenant serverless SaaS platform with only three people created massive operational challenges around automation, observability, and cost management at scale.

distributed-systems microservices
4 min
AWS

Digital Transformation at Santander: How Platform Engineering is Revolutionizing Cloud Infrastructure

Santander struggled to manage cloud infrastructure supporting billions of daily transactions across 200+ critical systems, facing complexity and scalability challenges in their banking operations.

distributed-systems microservices
5 min
Netflix

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

Netflix needed scalable, deep machine-level understanding of every piece of content across an expanding catalog (including live events and podcasts) to power recommendations and discovery, but building separate models per content type and modality doesn't scale.

ml-systems microservices
5 min
Netflix

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Netflix needed to spin up hundreds of containers in seconds to serve streaming traffic, but after modernizing their container runtime, they hit an unexpected performance bottleneck rooted in CPU architecture that impaired container scaling efficiency.

distributed-systems real-time-systems
5 min
Airbnb

Safeguarding Dynamic Configuration Changes at Scale

Dynamic configuration changes at scale can cause widespread outages if rolled out unsafely—a single bad config update can immediately affect all services and requests without the safety net of a gradual deployment process.

distributed-systems microservices
5 min
Dropbox

Insights from our executive roundtable on AI and engineering productivity

Engineering organizations face open questions about how to effectively integrate AI coding tools (like Claude Code and Cursor) into developer workflows and where these tools can have the most measurable impact on productivity.

ml-systems microservices
4 min
AWS

Mastering millisecond latency and millions of events: The event-driven architecture behind the Amazon Key Suite

The Amazon Key Suite had a tightly coupled monolithic architecture that struggled with reliability and scalability when processing millions of events at millisecond latency requirements across multiple service integrations.

microservices messaging-queues
5 min
AWS

How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters

Salesforce's Cluster Autoscaler could not efficiently scale and manage node provisioning across their fleet of 1,000+ EKS clusters, likely suffering from slow scaling decisions, suboptimal bin-packing, and operational complexity at massive scale.

distributed-systems load-balancing
4 min
Airbnb

Pay As a Local

Airbnb relied primarily on card payments across 220+ global markets, but many users preferred local payment methods, causing checkout friction, reduced accessibility, and lower adoption in key markets.

api-design microservices
5 min
Netflix

How Temporal Powers Reliable Cloud Operations at Netflix

Netflix needed reliable orchestration for business-critical cloud operations across teams like Open Connect CDN and Live reliability, but faced operational challenges as Temporal adoption grew since 2021.

distributed-systems microservices
5 min
Dropbox

Building the future: highlights from Dropbox’s 2025 summer intern class

This article is not a technical engineering blog post — it covers Dropbox's 2025 summer intern program highlights, focusing on professional growth, innovation culture, and community building rather than addressing a specific engineering challenge.

microservices
3 min
AWS

Build priority-based message processing with Amazon MQ and AWS App Runner

Standard message queues process messages in FIFO order, lacking the ability to prioritize urgent messages over lower-priority ones, which can cause critical tasks to wait behind less important work during high load.

messaging-queues real-time-systems
5 min
AWS

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

Organizations struggle to design well-architected cloud systems that balance cost optimization, security, reliability, and performance efficiency across increasingly complex AWS environments including AI-powered workloads.

security microservices
5 min
AWS

BASF Digital Farming builds a STAC-based solution on Amazon EKS

BASF Digital Farming needed a scalable way to catalog, discover, and serve large volumes of spatiotemporal geospatial data (satellite imagery, crop data) for their xarvio crop optimization platform, and their existing infrastructure struggled with the scale and query patterns of this data.

microservices storage-systems
4 min