Build a multi-tenant configuration system with tagged storage patterns
Building a scalable multi-tenant configuration service that maintains strict tenant isolation while supporting real-time updates without cache staleness or downtime.
Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod
Simplifying the deployment and scheduling of machine learning inference workloads across multiple instances and instance types on Amazon SageMaker HyperPod.
Building a high-volume metrics pipeline with OpenTelemetry and vmagent
Migrating a large-scale metrics pipeline from StatsD to OpenTelemetry while handling production traffic volumes without losing data or blocking dependent systems.
500 Tbps of capacity: 16 years of scaling our global network
How to scale a global content delivery and DDoS mitigation network to handle massive throughput (500 Tbps) while maintaining capacity to protect against record-breaking attacks.
Cloudflare targets 2029 for full post-quantum security
Cloudflare needed to prepare its global infrastructure and services for the threat of quantum computing attacks on current cryptographic standards before 2029.
From bytecode to bytes: automated magic packet generation
Cloudflare needed to automatically generate malware trigger packets for BPF bytecode analysis, which previously required hours of manual work.
How we built Organizations to help enterprises manage Cloudflare at scale
Cloudflare needed to enable enterprise customers to manage multiple accounts and resources under a unified organizational structure with centralized authorization and access control.
Welcome to Agents Week
How to enable AI agents to operate effectively at the edge of the internet with the security, performance, and reliability characteristics of Cloudflare's existing infrastructure.
Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases
Meta needed to modernize WebRTC across 50+ use cases while maintaining synchronization with upstream open-source development, avoiding the drift that typically occurs when large projects fork internally.
How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines
AI coding assistants were ineffective at making useful edits in large-scale data pipelines because they lacked sufficient understanding of complex, multi-repository codebases spanning multiple languages and thousands of files.
Trust But Canary: Configuration Safety at Scale
Safely deploying configuration changes at scale while minimizing the risk of widespread failures caused by faulty configurations.
Automate safety monitoring with computer vision and generative AI
Detecting safety hazards in real-time across hundreds of distributed operational sites using video feeds while maintaining low latency and managing the computational complexity of processing multiple camera streams.
How Aigen transformed agricultural robotics for sustainable farming with Amazon SageMaker AI
Aigen needed to scale machine learning pipelines across hundreds of distributed edge solar robots while managing data labeling and model training challenges in agricultural robotics.
Streamlining access to powerful disaster recovery capabilities of AWS
Organizations need a streamlined way to protect and recover entire AWS workloads across multiple layers (data, compute, infrastructure, networking, and configuration) in the event of a disaster.
My Journey to Airbnb — Jonathan Woodard
This article does not describe a specific engineering problem or technical solution.
Cloudflare Client-Side Security: smarter detection, now open to everyone
Detecting sophisticated client-side security threats like zero-day exploits while minimizing false positives in real-time across millions of requests.
Introducing EmDash — the spiritual successor to WordPress that solves plugin security
WordPress plugins pose significant security risks because they run with unrestricted access to the entire system, requiring a safer plugin architecture that isolates untrusted code.
Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers
Magic Transit customers needed the ability to define and enforce custom DDoS mitigation logic for proprietary and non-standard UDP protocols without being limited to Cloudflare's pre-built detection rules.
Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver
How to design a public DNS resolver that prioritizes user privacy while maintaining performance and trustworthiness at scale.
Why we're rethinking cache for the AI era
CDN cache systems were designed for human traffic patterns but struggle with the distinct access patterns of AI bot traffic, which now represents over 10 billion requests per week and threatens cache efficiency.
Improving storage efficiency in Magic Pocket, our immutable blob store
Dropbox needed to improve storage efficiency and resilience in Magic Pocket, their immutable blob store, when handling variable and changing workloads.
AI Helping Build Better AI: How Agents Accelerate Model Experi...
Training and evaluating AI models is resource-intensive, requiring significant human effort to generate quality training data and assess model outputs.
Announcing Our LinkedIn-Cornell 2024 Grant Recipients
Advancing AI research requires collaboration between industry and academia, but funding and partnership models need structured programs.
Career stories: Influencing engineering growth at LinkedIn
Growing engineering teams at scale requires clear career frameworks and mentorship to help engineers develop technical leadership skills.
Career stories: The math-music connection in data science
Data science teams need diverse skill sets that blend mathematical rigor with creative problem-solving to build effective ML systems.
Driving data enhancement & recruitment success with LinkedIn’s unified integrations
LinkedIn's recruitment platform needed richer data signals to improve candidate matching and recruiter success rates.
Engineering the next generation of LinkedIn’s Feed
LinkedIn's Feed needed to evolve to handle increasing content diversity, real-time ranking signals, and personalization at massive scale.
Introducing Northguard and Xinfra: scalable log storage at LinkedIn
LinkedIn's logging infrastructure couldn't scale cost-effectively to handle the massive volume of operational logs across thousands of services.
Optimizing LinkedIn Sales Navigator’s search pipeline with Spark
LinkedIn Sales Navigator's search pipeline had latency issues as query complexity and data volume grew.
Reimagining LinkedIn’s search tech stack
LinkedIn's legacy search infrastructure couldn't scale to handle growing query volumes and evolving relevance requirements across its platform.
Scaling LLM-Based ranking systems with SGLang at LinkedIn
LinkedIn's LLM-based ranking systems faced latency and throughput challenges when serving personalized results at scale.
Securing every Kubernetes workload at scale
Securing thousands of Kubernetes workloads across a large-scale infrastructure requires automated and consistent security policies.
The LinkedIn Generative AI Application Tech Stack: Personaliza...
Building personalized generative AI features at LinkedIn's scale required a robust and reliable application infrastructure that could serve millions of users.
AI for American-Produced Cement and Concrete
Designing high-quality, sustainable concrete mixes that are produced in the United States while optimizing for performance characteristics.
KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure
Meta needed to automatically optimize low-level infrastructure and kernel-level parameters for AI ranking models to improve performance without manual tuning.
Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads
Meta needed to scale their ads ranking models to LLM-scale complexity and size while maintaining inference latency requirements for real-time ad serving.