AWS

Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events

How to design systems that can recover from ransomware and destructive cyberattacks when backups, credentials, and infrastructure components have been compromised.

security storage-systems
4 min
AWS

How ALS GeoAnalytics LITHOLENS ™ revolutionizes core logging through machine learning with Amazon EKS

ALS GeoAnalytics needed to scale machine learning model training and inference for core logging analysis while managing computational costs effectively.

distributed-systems ml-systems
3 min
Airbnb

Scaling Airbnb’s identity graph with a unified knowledge graph infrastructure

Airbnb needed to scale their identity graph infrastructure to efficiently resolve user identities and understand relationships between entities across their platform.

databases distributed-systems
5 min
Google

A Smarter Google AI Edge Gallery: MCP integration, notifications, and session continuity

Enable on-device AI models to coordinate complex tasks across external data sources while maintaining persistent user context and proactive engagement without relying solely on cloud connectivity.

api-design ml-systems
5 min
Google

Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket

AI training pipelines were bottlenecked by slow data I/O when accessing training datasets stored in Google Cloud, limiting throughput and increasing total training time.

storage-systems ml-systems
5 min
Cloudflare

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.

databases observability
4 min
Meta

Labyrinth 1.1: Making End-to-End Encrypted Backups Even More Reliable

Ensuring end-to-end encrypted messages and conversation history survive device loss, device switches, and extended offline periods without compromising encryption guarantees.

storage-systems security
5 min
Meta

Migrating Data Ingestion Systems at Meta Scale

Meta needed to migrate their legacy data ingestion system to a new architecture while maintaining reliability and consistency for real-time social graph snapshots at massive scale.

distributed-systems storage-systems
5 min
Netflix

Powering Multimodal Intelligence for Video Search

Netflix needed to efficiently extract and surface key moments from hundreds or thousands of hours of raw video footage for editorial teams to accelerate the creative content production process.

ml-systems search
5 min
Meta

How Meta Is Strengthening End-to-End Encrypted Backups

How to enable end-to-end encrypted backups for messaging applications while ensuring recovery codes remain inaccessible to Meta, cloud providers, and other third parties.

security storage-systems
5 min
AWS

Real-time analytics: Oldcastle integrates Infor with Amazon Aurora and Amazon Quick Sight

Oldcastle needed to overcome the limitations of traditional ERP reporting to enable real-time analytics and dashboards for their Infor ERP system.

databases real-time-systems
5 min
Airbnb

Building a fault-tolerant metrics storage system at Airbnb

Building a metrics storage system capable of ingesting 50 million samples per second while reliably storing 2.5 petabytes of time series data at scale.

observability storage-systems
5 min
Cloudflare

Agents that remember: introducing Agent Memory

AI agents lack persistent memory mechanisms to retain context, learn from interactions, and improve decision-making over time.

storage-systems ml-systems
3 min
Cloudflare

Artifacts: versioned storage that speaks Git

Providing agents, developers, and automations with scalable, Git-compatible versioned storage that can handle tens of millions of repositories without forcing them to manage infrastructure.

storage-systems api-design
4 min
Cloudflare

Unweight: how we compressed an LLM 22% without sacrificing quality

GPU memory bandwidth constraints were limiting LLM inference efficiency across Cloudflare's distributed edge network, requiring optimization to deliver faster and cheaper inference.

ml-systems distributed-systems
4 min
AWS

Build a multi-tenant configuration system with tagged storage patterns

Building a scalable multi-tenant configuration service that maintains strict tenant isolation while supporting real-time updates without cache staleness or downtime.

caching storage-systems
5 min
AWS

Streamlining access to powerful disaster recovery capabilities of AWS

Organizations need a streamlined way to protect and recover entire AWS workloads across multiple layers (data, compute, infrastructure, networking, and configuration) in the event of a disaster.

storage-systems security
5 min
Cloudflare

A one-line Kubernetes fix that saved 600 hours a year

Cloudflare's Atlantis instance took 30 minutes to restart due to a Kubernetes volume permission bottleneck.

observability storage-systems
4 min
Dropbox

Improving storage efficiency in Magic Pocket, our immutable blob store

Dropbox needed to improve storage efficiency and resilience in Magic Pocket, their immutable blob store, when handling variable and changing workloads.

storage-systems observability
3 min
LinkedIn

Introducing Northguard and Xinfra: scalable log storage at LinkedIn

LinkedIn's logging infrastructure couldn't scale cost-effectively to handle the massive volume of operational logs across thousands of services.

observability storage-systems
3 min
AWS

BASF Digital Farming builds a STAC-based solution on Amazon EKS

BASF Digital Farming needed a scalable way to catalog, discover, and serve large volumes of spatiotemporal geospatial data (satellite imagery, crop data) for their xarvio crop optimization platform, and their existing infrastructure struggled with the scale and query patterns of this data.

microservices storage-systems
4 min
AWS

How Artera enhances prostate cancer diagnostics using AWS

Artera needed to develop and scale an AI-powered prostate cancer diagnostic test, requiring significant compute resources for model training/inference and a reliable pipeline to deliver timely, personalized treatment recommendations.

ml-systems storage-systems
4 min
AWS

The Hidden Price Tag: Uncovering Hidden Costs in Cloud Architectures with the AWS Well-Architected Framework

Organizations migrating to or operating in the cloud encounter hidden and unexpected costs due to suboptimal architectural decisions, resource misconfigurations, and lack of adherence to cloud best practices.

distributed-systems storage-systems
5 min
Dropbox

Half-Quadratic Quantization of large machine learning models

Large machine learning models require significant memory and compute resources, making deployment and inference expensive and slow, especially in resource-constrained environments.

ml-systems storage-systems
3 min
Dropbox

How low-bit inference enables efficient AI

Running AI inference for products like Dropbox Dash at scale is expensive and resource-intensive, requiring efficient use of compute and memory to make the product accessible to a broad user base.

ml-systems storage-systems
3 min
Meta

FFmpeg at Meta: Media Processing at Scale

Meta needed to handle massive-scale media processing (encoding, transcoding, filtering) across its family of apps, requiring efficient orchestration of complex audio/video pipelines using FFmpeg at an unprecedented scale.

storage-systems distributed-systems
5 min
Meta

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Meta's large-scale infrastructure relies on jemalloc for memory allocation, but the codebase had accumulated maintenance burden and needed modernization to keep pace with evolving hardware and workload demands.

storage-systems distributed-systems
5 min
Netflix

AV1 — Now Powering 30% of Netflix Streaming

Delivering high-quality streaming video across diverse devices and varying network conditions requires efficient video encoding; legacy codecs like H.264 and VP9 were limiting compression efficiency, consuming more bandwidth for equivalent visual quality.

real-time-systems storage-systems
5 min
Netflix

Netflix Live Origin

Netflix needed a custom origin server to bridge its cloud-based live streaming pipelines with its CDN (Open Connect), handling the unique challenges of live content delivery such as low-latency requirements, reliability, and the real-time nature of live streams compared to on-demand content.

real-time-systems distributed-systems
5 min