Browse past weeks of engineering reads.
Security teams needed visibility and compliance monitoring of Claude Enterprise API usage across their organization without leaving their existing security infrastructure.
Determining whether security-focused LLMs can effectively identify vulnerabilities in live production infrastructure code at scale.
Browser Run needed higher usage limits, better performance, and improved reliability while increasing development velocity for their browser automation service.
A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.
Rapidly detect, investigate, and mitigate a critical Linux kernel privilege escalation vulnerability across a global edge computing fleet without impacting customers.
When DENIC published invalid DNSSEC signatures for the .de TLD, DNS resolvers like 1.1.1.1 faced a critical decision: reject all .de domain queries due to signature validation failures or serve potentially stale cached responses to maintain availability.
Cloudflare needed to make their global edge infrastructure more resilient to configuration changes and prevent widespread outages caused by unsafe deployments.
How to measure, analyze, and publicly report on Internet disruptions caused by geopolitical events, infrastructure attacks, and power outages in real-time across global networks.
Rust panics in Cloudflare Workers were fatal and poisoned the entire worker instance, making applications unreliable when unhandled errors occurred.
Cloudflare needed to scale code review processes across their engineering organization while maintaining code quality and security standards without overwhelming human reviewers.
Cloudflare needed to build an internal AI engineering stack that could handle massive scale (20 million requests, 241 billion tokens) while dogfooding their own platform products.
Cloudflare needed to improve request handling performance across its global network to maintain competitive advantage over other CDNs.
AI agents needed a way to interact with browsers at scale while maintaining visibility and control over automated actions, requiring higher concurrency and real-time debugging capabilities.
How to efficiently run inference for extra-large language models on edge infrastructure while maintaining low latency and high throughput across distributed Cloudflare servers.
Users had to manually navigate multiple tabs and interfaces within the Cloudflare dashboard to troubleshoot issues and manage their infrastructure, creating friction in the workflow.
Website owners needed a way to measure and understand how well their sites support AI agents and web crawlers for indexing and integration.
Cloudflare's Atlantis instance took 30 minutes to restart due to a Kubernetes volume permission bottleneck.
Detecting sophisticated client-side security threats like zero-day exploits while minimizing false positives in real-time across millions of requests.
How to design a public DNS resolver that prioritizes user privacy while maintaining performance and trustworthiness at scale.
Security teams were overwhelmed by the volume of raw security data across Cloudflare's platform, making it difficult to prioritize and act on vulnerabilities and threats efficiently.
Security teams lacked a unified view across multiple Cloudflare datasets, making it difficult to identify and investigate multi-vector attacks that span different attack surfaces and log sources.