Browse past weeks of engineering reads.
Designing monitoring and observability systems that remain functional and reliable even when the core infrastructure they monitor is failing or degraded.
Netflix needed to build reliable operations infrastructure to support live streaming at massive scale, going from one show per month to nine shows per day with tens of millions of concurrent viewers.
Cloudflare needed to make their global edge infrastructure more resilient to configuration changes and prevent widespread outages caused by unsafe deployments.
Dynamic configuration changes at scale can cause widespread outages if rolled out unsafely—a single bad config update can immediately affect all services and requests without the safety net of a gradual deployment process.