Browse past weeks of engineering reads.
How to design systems that can recover from ransomware and destructive cyberattacks when backups, credentials, and infrastructure components have been compromised.
ALS GeoAnalytics needed to scale machine learning model training and inference for core logging analysis while managing computational costs effectively.
Airbnb needed to scale their identity graph infrastructure to efficiently resolve user identities and understand relationships between entities across their platform.
Enable on-device AI models to coordinate complex tasks across external data sources while maintaining persistent user context and proactive engagement without relying solely on cloud connectivity.
AI training pipelines were bottlenecked by slow data I/O when accessing training datasets stored in Google Cloud, limiting throughput and increasing total training time.
A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.
Ensuring end-to-end encrypted messages and conversation history survive device loss, device switches, and extended offline periods without compromising encryption guarantees.
Meta needed to migrate their legacy data ingestion system to a new architecture while maintaining reliability and consistency for real-time social graph snapshots at massive scale.
Netflix needed to efficiently extract and surface key moments from hundreds or thousands of hours of raw video footage for editorial teams to accelerate the creative content production process.
How to enable end-to-end encrypted backups for messaging applications while ensuring recovery codes remain inaccessible to Meta, cloud providers, and other third parties.
Oldcastle needed to overcome the limitations of traditional ERP reporting to enable real-time analytics and dashboards for their Infor ERP system.
Building a metrics storage system capable of ingesting 50 million samples per second while reliably storing 2.5 petabytes of time series data at scale.
AI agents lack persistent memory mechanisms to retain context, learn from interactions, and improve decision-making over time.
Providing agents, developers, and automations with scalable, Git-compatible versioned storage that can handle tens of millions of repositories without forcing them to manage infrastructure.
GPU memory bandwidth constraints were limiting LLM inference efficiency across Cloudflare's distributed edge network, requiring optimization to deliver faster and cheaper inference.
Building a scalable multi-tenant configuration service that maintains strict tenant isolation while supporting real-time updates without cache staleness or downtime.
Organizations need a streamlined way to protect and recover entire AWS workloads across multiple layers (data, compute, infrastructure, networking, and configuration) in the event of a disaster.
Cloudflare's Atlantis instance took 30 minutes to restart due to a Kubernetes volume permission bottleneck.
Dropbox needed to improve storage efficiency and resilience in Magic Pocket, their immutable blob store, when handling variable and changing workloads.
LinkedIn's logging infrastructure couldn't scale cost-effectively to handle the massive volume of operational logs across thousands of services.
BASF Digital Farming needed a scalable way to catalog, discover, and serve large volumes of spatiotemporal geospatial data (satellite imagery, crop data) for their xarvio crop optimization platform, and their existing infrastructure struggled with the scale and query patterns of this data.
Artera needed to develop and scale an AI-powered prostate cancer diagnostic test, requiring significant compute resources for model training/inference and a reliable pipeline to deliver timely, personalized treatment recommendations.
Organizations migrating to or operating in the cloud encounter hidden and unexpected costs due to suboptimal architectural decisions, resource misconfigurations, and lack of adherence to cloud best practices.
Large machine learning models require significant memory and compute resources, making deployment and inference expensive and slow, especially in resource-constrained environments.
Running AI inference for products like Dropbox Dash at scale is expensive and resource-intensive, requiring efficient use of compute and memory to make the product accessible to a broad user base.
Meta needed to handle massive-scale media processing (encoding, transcoding, filtering) across its family of apps, requiring efficient orchestration of complex audio/video pipelines using FFmpeg at an unprecedented scale.
Meta's large-scale infrastructure relies on jemalloc for memory allocation, but the codebase had accumulated maintenance burden and needed modernization to keep pace with evolving hardware and workload demands.
Delivering high-quality streaming video across diverse devices and varying network conditions requires efficient video encoding; legacy codecs like H.264 and VP9 were limiting compression efficiency, consuming more bandwidth for equivalent visual quality.
Netflix needed a custom origin server to bridge its cloud-based live streaming pipelines with its CDN (Open Connect), handling the unique challenges of live content delivery such as low-latency requirements, reliability, and the real-time nature of live streams compared to on-demand content.