Browse past weeks of engineering reads.
Cloudflare's Atlantis instance took 30 minutes to restart due to a Kubernetes volume permission bottleneck.
Organizations migrating to or operating in the cloud encounter hidden and unexpected costs due to suboptimal architectural decisions, resource misconfigurations, and lack of adherence to cloud best practices.
Meta needed to handle massive-scale media processing (encoding, transcoding, filtering) across its family of apps, requiring efficient orchestration of complex audio/video pipelines using FFmpeg at an unprecedented scale.
Meta's large-scale infrastructure relies on jemalloc for memory allocation, but the codebase had accumulated maintenance burden and needed modernization to keep pace with evolving hardware and workload demands.
Running AI inference for products like Dropbox Dash at scale is expensive and resource-intensive, requiring efficient use of compute and memory to make the product accessible to a broad user base.
Artera needed to develop and scale an AI-powered prostate cancer diagnostic test, requiring significant compute resources for model training/inference and a reliable pipeline to deliver timely, personalized treatment recommendations.
Netflix needed a custom origin server to bridge its cloud-based live streaming pipelines with its CDN (Open Connect), handling the unique challenges of live content delivery such as low-latency requirements, reliability, and the real-time nature of live streams compared to on-demand content.
Delivering high-quality streaming video across diverse devices and varying network conditions requires efficient video encoding; legacy codecs like H.264 and VP9 were limiting compression efficiency, consuming more bandwidth for equivalent visual quality.
BASF Digital Farming needed a scalable way to catalog, discover, and serve large volumes of spatiotemporal geospatial data (satellite imagery, crop data) for their xarvio crop optimization platform, and their existing infrastructure struggled with the scale and query patterns of this data.
Large machine learning models require significant memory and compute resources, making deployment and inference expensive and slow, especially in resource-constrained environments.