Browse past weeks of engineering reads.
Airbnb needed to scale their identity graph infrastructure to efficiently resolve user identities and understand relationships between entities across their platform.
Building production-grade AI agents that can maintain context and state across long-running enterprise workflows spanning days or weeks without losing information during idle periods or server restarts.
A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.
Building a social discovery system that efficiently surfaces Reels watched and reacted to by friends while scaling to billions of users.
Query performance degradation at massive scale (10+ trillion rows, 15M events/second) where repeated identical queries were consuming excessive resources and impacting latency.
How to build a durable workflow execution engine that can recover from failures mid-process without losing state or duplicating work.
Oldcastle needed to overcome the limitations of traditional ERP reporting to enable real-time analytics and dashboards for their Infor ERP system.
AI agents lack persistent memory mechanisms to retain context, learn from interactions, and improve decision-making over time.
Enabling serverless applications to connect to managed relational databases without managing infrastructure or dealing with connection pooling complexities.
LinkedIn's recruitment platform needed richer data signals to improve candidate matching and recruiter success rates.
Airbnb needed to advance its AI, data science, and machine learning capabilities across multiple domains (NLP, optimization, measurement science) to improve its travel and living platform, requiring solutions to challenges in search ranking, recommendation, experimentation, and large-scale data processing.
Airbnb's multi-tenant key-value store (Mussel) used static rate limiting that couldn't adapt to varying traffic patterns and spikes, risking degraded performance and reliability for all tenants during surges.
Netflix's relational database ecosystem lacked standardization, with databases spread across RDS Postgres and other technologies, leading to inconsistent functionality, suboptimal performance, and higher total cost of ownership.
Netflix's localization analytics infrastructure (tracking dubbing, subtitling, and translation across hundreds of languages and regions) could not keep pace with the rapidly growing scale of global content, making it difficult to derive timely insights for content localization decisions.
Netflix's Graph Search platform for federated enterprise data required users to write structured queries, limiting accessibility and ease of use despite the system being scalable and configurable.