Archives — Distributed Readings

AWS ↗

Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events

How to design systems that can recover from ransomware and destructive cyberattacks when backups, credentials, and infrastructure components have been compromised.

security storage-systems

4 min

AWS ↗

How ALS GeoAnalytics LITHOLENS ™ revolutionizes core logging through machine learning with Amazon EKS

ALS GeoAnalytics needed to scale machine learning model training and inference for core logging analysis while managing computational costs effectively.

distributed-systems ml-systems

3 min

Airbnb ↗

Scaling Airbnb’s identity graph with a unified knowledge graph infrastructure

Airbnb needed to scale their identity graph infrastructure to efficiently resolve user identities and understand relationships between entities across their platform.

databases distributed-systems

5 min

Google ↗

A Smarter Google AI Edge Gallery: MCP integration, notifications, and session continuity

Enable on-device AI models to coordinate complex tasks across external data sources while maintaining persistent user context and proactive engagement without relying solely on cloud connectivity.

api-design ml-systems

5 min

Google ↗

Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket

AI training pipelines were bottlenecked by slow data I/O when accessing training datasets stored in Google Cloud, limiting throughput and increasing total training time.

storage-systems ml-systems

5 min

Cloudflare ↗

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.

databases observability

4 min

Meta ↗

Labyrinth 1.1: Making End-to-End Encrypted Backups Even More Reliable

Ensuring end-to-end encrypted messages and conversation history survive device loss, device switches, and extended offline periods without compromising encryption guarantees.

storage-systems security

5 min

Meta ↗

Migrating Data Ingestion Systems at Meta Scale

Meta needed to migrate their legacy data ingestion system to a new architecture while maintaining reliability and consistency for real-time social graph snapshots at massive scale.

distributed-systems storage-systems

5 min

Netflix ↗

Powering Multimodal Intelligence for Video Search

Netflix needed to efficiently extract and surface key moments from hundreds or thousands of hours of raw video footage for editorial teams to accelerate the creative content production process.

ml-systems search

5 min

Meta ↗

How Meta Is Strengthening End-to-End Encrypted Backups

How to enable end-to-end encrypted backups for messaging applications while ensuring recovery codes remain inaccessible to Meta, cloud providers, and other third parties.

security storage-systems

5 min

AWS ↗

Real-time analytics: Oldcastle integrates Infor with Amazon Aurora and Amazon Quick Sight

Oldcastle needed to overcome the limitations of traditional ERP reporting to enable real-time analytics and dashboards for their Infor ERP system.

databases real-time-systems

5 min

Airbnb ↗

Building a fault-tolerant metrics storage system at Airbnb

Building a metrics storage system capable of ingesting 50 million samples per second while reliably storing 2.5 petabytes of time series data at scale.

observability storage-systems

5 min

Cloudflare ↗

Agents that remember: introducing Agent Memory

AI agents lack persistent memory mechanisms to retain context, learn from interactions, and improve decision-making over time.

storage-systems ml-systems

3 min

Cloudflare ↗

Artifacts: versioned storage that speaks Git

Providing agents, developers, and automations with scalable, Git-compatible versioned storage that can handle tens of millions of repositories without forcing them to manage infrastructure.

storage-systems api-design

4 min

Cloudflare ↗

Unweight: how we compressed an LLM 22% without sacrificing quality

GPU memory bandwidth constraints were limiting LLM inference efficiency across Cloudflare's distributed edge network, requiring optimization to deliver faster and cheaper inference.

ml-systems distributed-systems

4 min

AWS ↗

Build a multi-tenant configuration system with tagged storage patterns

Building a scalable multi-tenant configuration service that maintains strict tenant isolation while supporting real-time updates without cache staleness or downtime.

caching storage-systems

5 min

AWS ↗

Streamlining access to powerful disaster recovery capabilities of AWS

Organizations need a streamlined way to protect and recover entire AWS workloads across multiple layers (data, compute, infrastructure, networking, and configuration) in the event of a disaster.

storage-systems security

5 min

Cloudflare ↗

A one-line Kubernetes fix that saved 600 hours a year

Cloudflare's Atlantis instance took 30 minutes to restart due to a Kubernetes volume permission bottleneck.

observability storage-systems

4 min

Dropbox ↗

Improving storage efficiency in Magic Pocket, our immutable blob store

Dropbox needed to improve storage efficiency and resilience in Magic Pocket, their immutable blob store, when handling variable and changing workloads.

storage-systems observability

3 min

LinkedIn ↗

Introducing Northguard and Xinfra: scalable log storage at LinkedIn

LinkedIn's logging infrastructure couldn't scale cost-effectively to handle the massive volume of operational logs across thousands of services.

observability storage-systems

3 min

AWS ↗

BASF Digital Farming builds a STAC-based solution on Amazon EKS

BASF Digital Farming needed a scalable way to catalog, discover, and serve large volumes of spatiotemporal geospatial data (satellite imagery, crop data) for their xarvio crop optimization platform, and their existing infrastructure struggled with the scale and query patterns of this data.

microservices storage-systems

4 min

AWS ↗

How Artera enhances prostate cancer diagnostics using AWS

Artera needed to develop and scale an AI-powered prostate cancer diagnostic test, requiring significant compute resources for model training/inference and a reliable pipeline to deliver timely, personalized treatment recommendations.

ml-systems storage-systems

4 min

AWS ↗

The Hidden Price Tag: Uncovering Hidden Costs in Cloud Architectures with the AWS Well-Architected Framework

Organizations migrating to or operating in the cloud encounter hidden and unexpected costs due to suboptimal architectural decisions, resource misconfigurations, and lack of adherence to cloud best practices.

distributed-systems storage-systems

5 min

Dropbox ↗

Half-Quadratic Quantization of large machine learning models

Large machine learning models require significant memory and compute resources, making deployment and inference expensive and slow, especially in resource-constrained environments.

ml-systems storage-systems

3 min

Dropbox ↗

How low-bit inference enables efficient AI

Running AI inference for products like Dropbox Dash at scale is expensive and resource-intensive, requiring efficient use of compute and memory to make the product accessible to a broad user base.

ml-systems storage-systems

3 min

Meta ↗

FFmpeg at Meta: Media Processing at Scale

Meta needed to handle massive-scale media processing (encoding, transcoding, filtering) across its family of apps, requiring efficient orchestration of complex audio/video pipelines using FFmpeg at an unprecedented scale.

storage-systems distributed-systems

5 min

Meta ↗

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Meta's large-scale infrastructure relies on jemalloc for memory allocation, but the codebase had accumulated maintenance burden and needed modernization to keep pace with evolving hardware and workload demands.

storage-systems distributed-systems

5 min

Netflix ↗

AV1 — Now Powering 30% of Netflix Streaming

Delivering high-quality streaming video across diverse devices and varying network conditions requires efficient video encoding; legacy codecs like H.264 and VP9 were limiting compression efficiency, consuming more bandwidth for equivalent visual quality.

real-time-systems storage-systems

5 min

Netflix ↗

Netflix Live Origin

Netflix needed a custom origin server to bridge its cloud-based live streaming pipelines with its CDN (Open Connect), handling the unique challenges of live content delivery such as low-latency requirements, reliability, and the real-time nature of live streams compared to on-demand content.

real-time-systems distributed-systems

5 min