Build a multi-tenant configuration system with tagged storage patterns
Building a scalable multi-tenant configuration service that maintains strict tenant isolation while supporting real-time updates without cache staleness or downtime.
Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod
Simplifying the deployment and scheduling of machine learning inference workloads across multiple instances and instance types on Amazon SageMaker HyperPod.
Building a high-volume metrics pipeline with OpenTelemetry and vmagent
Migrating a large-scale metrics pipeline from StatsD to OpenTelemetry while handling production traffic volumes without losing data or blocking dependent systems.
500 Tbps of capacity: 16 years of scaling our global network
How to scale a global content delivery and DDoS mitigation network to handle massive throughput (500 Tbps) while maintaining capacity to protect against record-breaking attacks.
Cloudflare targets 2029 for full post-quantum security
Cloudflare needed to prepare its global infrastructure and services for the threat of quantum computing attacks on current cryptographic standards before 2029.
Welcome to Agents Week
How to enable AI agents to operate effectively at the edge of the internet with the security, performance, and reliability characteristics of Cloudflare's existing infrastructure.
Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases
Meta needed to modernize WebRTC across 50+ use cases while maintaining synchronization with upstream open-source development, avoiding the drift that typically occurs when large projects fork internally.
How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines
AI coding assistants were ineffective at making useful edits in large-scale data pipelines because they lacked sufficient understanding of complex, multi-repository codebases spanning multiple languages and thousands of files.
Trust But Canary: Configuration Safety at Scale
Safely deploying configuration changes at scale while minimizing the risk of widespread failures caused by faulty configurations.
Automate safety monitoring with computer vision and generative AI
Detecting safety hazards in real-time across hundreds of distributed operational sites using video feeds while maintaining low latency and managing the computational complexity of processing multiple camera streams.
How Aigen transformed agricultural robotics for sustainable farming with Amazon SageMaker AI
Aigen needed to scale machine learning pipelines across hundreds of distributed edge solar robots while managing data labeling and model training challenges in agricultural robotics.
Streamlining access to powerful disaster recovery capabilities of AWS
Organizations need a streamlined way to protect and recover entire AWS workloads across multiple layers (data, compute, infrastructure, networking, and configuration) in the event of a disaster.
Introducing EmDash — the spiritual successor to WordPress that solves plugin security
WordPress plugins pose significant security risks because they run with unrestricted access to the entire system, requiring a safer plugin architecture that isolates untrusted code.
Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers
Magic Transit customers needed the ability to define and enforce custom DDoS mitigation logic for proprietary and non-standard UDP protocols without being limited to Cloudflare's pre-built detection rules.
Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver
How to design a public DNS resolver that prioritizes user privacy while maintaining performance and trustworthiness at scale.
Why we're rethinking cache for the AI era
CDN cache systems were designed for human traffic patterns but struggle with the distinct access patterns of AI bot traffic, which now represents over 10 billion requests per week and threatens cache efficiency.
Improving storage efficiency in Magic Pocket, our immutable blob store
Dropbox needed to improve storage efficiency and resilience in Magic Pocket, their immutable blob store, when handling variable and changing workloads.
AI Helping Build Better AI: How Agents Accelerate Model Experi...
Training and evaluating AI models is resource-intensive, requiring significant human effort to generate quality training data and assess model outputs.
Reimagining LinkedIn’s search tech stack
LinkedIn's legacy search infrastructure couldn't scale to handle growing query volumes and evolving relevance requirements across its platform.
Scaling LLM-Based ranking systems with SGLang at LinkedIn
LinkedIn's LLM-based ranking systems faced latency and throughput challenges when serving personalized results at scale.
KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure
Meta needed to automatically optimize low-level infrastructure and kernel-level parameters for AI ranking models to improve performance without manual tuning.