Browse past weeks of engineering reads.
Enabling developers to deploy and scale autonomous agent workflows globally while maintaining security isolation and control over access to private backend systems.
Browser Run needed higher usage limits, better performance, and improved reliability while increasing development velocity for their browser automation service.
A partitioning change to a petabyte-scale ClickHouse cluster caused billing pipeline jobs to stall without obvious error signals in standard metrics.
Rapidly detect, investigate, and mitigate a critical Linux kernel privilege escalation vulnerability across a global edge computing fleet without impacting customers.
When DENIC published invalid DNSSEC signatures for the .de TLD, DNS resolvers like 1.1.1.1 faced a critical decision: reject all .de domain queries due to signature validation failures or serve potentially stale cached responses to maintain availability.
Cloudflare needed to make their global edge infrastructure more resilient to configuration changes and prevent widespread outages caused by unsafe deployments.
Enable multi-tenant platforms to execute millions of unique, durable workflows without incurring significant idle infrastructure costs.
Protecting IPsec communications from future quantum computing threats while maintaining current interoperability with existing infrastructure.
How to measure, analyze, and publicly report on Internet disruptions caused by geopolitical events, infrastructure attacks, and power outages in real-time across global networks.
How to enable developers to build and deploy AI agents at scale across a distributed edge computing network while maintaining security and providing necessary infrastructure tools.
Traditional bot detection mechanisms are becoming ineffective as AI assistants and privacy proxies blur the distinction between legitimate users and automated abuse.
Cloudflare needed to improve request handling performance across its global network to maintain competitive advantage over other CDNs.
Providing agents, developers, and automations with scalable, Git-compatible versioned storage that can handle tens of millions of repositories without forcing them to manage infrastructure.
How to efficiently run inference for extra-large language models on edge infrastructure while maintaining low latency and high throughput across distributed Cloudflare servers.
Enabling AI agents to send, receive, and process email natively as a multi-channel communication medium without requiring developers to build custom email infrastructure.
Third-party feature flag services introduce unacceptable latency for applications requiring sub-millisecond flag evaluation at global scale.
Building a scalable platform for deploying AI agents at the edge that can think, act, and persist state across distributed Cloudflare infrastructure.
Cloudflare Workflows needed to support higher concurrency and creation rate limits to enable durable background agents at scale.
GPU memory bandwidth constraints were limiting LLM inference efficiency across Cloudflare's distributed edge network, requiring optimization to deliver faster and cheaper inference.
How to scale a global content delivery and DDoS mitigation network to handle massive throughput (500 Tbps) while maintaining capacity to protect against record-breaking attacks.
Cloudflare needed to prepare its global infrastructure and services for the threat of quantum computing attacks on current cryptographic standards before 2029.
How to enable AI agents to operate effectively at the edge of the internet with the security, performance, and reliability characteristics of Cloudflare's existing infrastructure.
WordPress plugins pose significant security risks because they run with unrestricted access to the entire system, requiring a safer plugin architecture that isolates untrusted code.
Magic Transit customers needed the ability to define and enforce custom DDoS mitigation logic for proprietary and non-standard UDP protocols without being limited to Cloudflare's pre-built detection rules.
How to design a public DNS resolver that prioritizes user privacy while maintaining performance and trustworthiness at scale.
CDN cache systems were designed for human traffic patterns but struggle with the distinct access patterns of AI bot traffic, which now represents over 10 billion requests per week and threatens cache efficiency.
The Cloudflare One SASE client's Proxy Mode relied on user-space TCP stacks for tunneling traffic, introducing significant overhead that limited throughput and increased latency for end users.
Enterprise SASE (Secure Access Service Edge) migrations traditionally take 18+ months due to architectural complexity, requiring organizations to integrate networking and security across global infrastructure.
Tunnel layering in Cloudflare's WARP/One client caused MTU mismatches, leading to silently dropped oversized packets that degraded connectivity and resilience.
Enterprises connecting multiple private networks via tunnels frequently encounter overlapping IP address ranges (e.g., multiple sites using 10.0.0.0/8), making traditional routing tables unable to determine which tunnel should receive return traffic.
Cloudflare's existing server fleet could not keep pace with rapidly growing global traffic demands, requiring a new generation of hardware with significantly higher compute and network throughput.
Customers needed precise control over where their data is processed geographically to meet diverse compliance requirements (e.g., GDPR, data sovereignty laws), but existing pre-defined regional options were too coarse-grained to cover all regulatory and performance needs.
Cloudflare needed to significantly increase edge compute throughput per server but faced a tradeoff where high-core-count CPUs came with smaller per-core L3 cache, risking latency penalties for cache-dependent workloads.
Running large AI models for agent workloads on edge infrastructure was cost-prohibitive and required significant inference stack optimization to serve models like Kimi K2.5 efficiently at scale.