Browse past weeks of engineering reads.
How to design systems that can recover from ransomware and destructive cyberattacks when backups, credentials, and infrastructure components have been compromised.
Security teams needed visibility and compliance monitoring of Claude Enterprise API usage across their organization without leaving their existing security infrastructure.
Enabling developers to deploy and scale autonomous agent workflows globally while maintaining security isolation and control over access to private backend systems.
Determining whether security-focused LLMs can effectively identify vulnerabilities in live production infrastructure code at scale.
Organizations need to securely build, deploy, and govern autonomous AI agents at enterprise scale as the industry transitions from experimental LLMs to production agentic AI systems.
Deploying and managing AI agents at scale in production requires infrastructure for state management, security governance, and complex workflow orchestration that goes beyond demo implementations.
Building safe, reliable, and autonomous agents that can act independently across multiple enterprise systems while maintaining security, governance, and reliability guardrails.
Organizations need to secure their AI systems and infrastructure against emerging AI-era threats while maintaining the ability to leverage AI's potential at scale.
Developers using Google's AI APIs (Gemini and Google APIs) are exposing their API keys to unauthorized access, leading to account compromise, token theft, and service abuse.
Developers needed a unified, secure way to build AI agents locally and deploy them to Google Cloud with standardized protocols and tooling.
Enabling seamless connectivity, governance, and security across multi-agent AI systems and core applications distributed globally at planet scale.
Building a multi-tenant architecture that isolates tenants without requiring separate AWS accounts while maintaining stateful service deployments.
Organizations must determine whether to operate under a single AWS organization or split into multiple organizations based on their operational, security, and scaling requirements.
Streaming CloudWatch metrics to internal VPC-based OpenTelemetry collectors without exposing them to the internet.
CUBIC congestion control algorithm's congestion window was becoming pinned at minimum values in QUIC, causing severe performance degradation due to incorrect idle period detection.
Ensuring end-to-end encrypted messages and conversation history survive device loss, device switches, and extended offline periods without compromising encryption guarantees.
Rapidly detect, investigate, and mitigate a critical Linux kernel privilege escalation vulnerability across a global edge computing fleet without impacting customers.
When DENIC published invalid DNSSEC signatures for the .de TLD, DNS resolvers like 1.1.1.1 faced a critical decision: reject all .de domain queries due to signature validation failures or serve potentially stale cached responses to maintain availability.
Detecting and preventing first-party fraud at scale across a payment network where legitimate users abuse policies through multiple accounts, free trial cycling, and refund exploitation.
Enable autonomous agents to programmatically access payment instruments and execute transactions without requiring human intervention or direct card/account access.
Detecting and preventing fraudulent behavior in free trial signups, such as repeated trial abuse and missed cancellations, at scale with high accuracy.
Understanding and optimizing the checkout conversion funnel across diverse ecommerce businesses to identify what drives successful transactions in modern online payment flows.
Detecting and preventing sophisticated fraud attacks while minimizing friction for legitimate users in payment systems.
How to enable autonomous agents to programmatically create Cloudflare accounts, purchase domains, and deploy infrastructure without manual dashboard interaction or credential handling.
Cloudflare needed to make their global edge infrastructure more resilient to configuration changes and prevent widespread outages caused by unsafe deployments.
Protecting IPsec communications from future quantum computing threats while maintaining current interoperability with existing infrastructure.
How to measure, analyze, and publicly report on Internet disruptions caused by geopolitical events, infrastructure attacks, and power outages in real-time across global networks.
How to enable end-to-end encrypted backups for messaging applications while ensuring recovery codes remain inaccessible to Meta, cloud providers, and other third parties.
Traditional rule-based KYC (Know Your Customer) systems lack the autonomous decision-making capability and real-time validation speed needed for modern financial services compliance operations.
Enable multiple independent organizations to securely exchange Product Carbon Footprint (PCF) data within a shared data space while maintaining data sovereignty and tenant isolation.
How to enable developers to build and deploy AI agents at scale across a distributed edge computing network while maintaining security and providing necessary infrastructure tools.
Rust panics in Cloudflare Workers were fatal and poisoned the entire worker instance, making applications unreliable when unhandled errors occurred.
Traditional bot detection mechanisms are becoming ineffective as AI assistants and privacy proxies blur the distinction between legitimate users and automated abuse.
Cloudflare needed to scale code review processes across their engineering organization while maintaining code quality and security standards without overwhelming human reviewers.
Cloudflare needed to build an internal AI engineering stack that could handle massive scale (20 million requests, 241 billion tokens) while dogfooding their own platform products.
How can Airbnb enable social features and community connections while maintaining strict user privacy and giving users control over their personal data sharing?
Cloudflare needed to improve request handling performance across its global network to maintain competitive advantage over other CDNs.
Users had to manually navigate multiple tabs and interfaces within the Cloudflare dashboard to troubleshoot issues and manage their infrastructure, creating friction in the workflow.
Website owners needed a way to measure and understand how well their sites support AI agents and web crawlers for indexing and integration.
AI crawlers were ingesting deprecated and non-canonical content despite soft directives like robots.txt, requiring a way to enforce canonical versions without modifying origin infrastructure.
Developers lack effective mechanisms to prevent unauthorized access when API credentials are accidentally exposed or compromised.
Meta needed to migrate its infrastructure and systems to post-quantum cryptography standards before quantum computers could break existing encryption schemes.
How to scale a global content delivery and DDoS mitigation network to handle massive throughput (500 Tbps) while maintaining capacity to protect against record-breaking attacks.
Cloudflare needed to prepare its global infrastructure and services for the threat of quantum computing attacks on current cryptographic standards before 2029.
Cloudflare needed to automatically generate malware trigger packets for BPF bytecode analysis, which previously required hours of manual work.
Cloudflare needed to enable enterprise customers to manage multiple accounts and resources under a unified organizational structure with centralized authorization and access control.
How to enable AI agents to operate effectively at the edge of the internet with the security, performance, and reliability characteristics of Cloudflare's existing infrastructure.
Generali Malaysia needed to optimize Kubernetes operations on AWS while reducing operational overhead, managing costs, and improving security posture.
Organizations need a streamlined way to protect and recover entire AWS workloads across multiple layers (data, compute, infrastructure, networking, and configuration) in the event of a disaster.
This article does not describe a specific engineering problem or technical solution.
Detecting sophisticated client-side security threats like zero-day exploits while minimizing false positives in real-time across millions of requests.
WordPress plugins pose significant security risks because they run with unrestricted access to the entire system, requiring a safer plugin architecture that isolates untrusted code.
Magic Transit customers needed the ability to define and enforce custom DDoS mitigation logic for proprietary and non-standard UDP protocols without being limited to Cloudflare's pre-built detection rules.
How to design a public DNS resolver that prioritizes user privacy while maintaining performance and trustworthiness at scale.
How to safely execute untrusted AI-generated code with minimal latency and resource overhead.
Securing thousands of Kubernetes workloads across a large-scale infrastructure requires automated and consistent security policies.
Agricultural supply chains (cotton/food) lack end-to-end traceability, making it difficult to verify sustainability claims, track climate impact, and ensure circularity across complex multi-party value chains.
Convera needed to implement fine-grained authorization for their API platform, where coarse-grained access controls were insufficient to manage complex permission requirements across API resources and actions.
Organizations struggle to design well-architected cloud systems that balance cost optimization, security, reliability, and performance efficiency across increasingly complex AWS environments including AI-powered workloads.
Securing Amazon Elastic VMware Service (EVS) environments requires centralized traffic inspection across multiple VPCs, on-premises data centers, and internet egress points, which is complex to architect and implement.
Organizations operating under European digital sovereignty requirements need resilient failover capabilities, but regulatory constraints on data residency and governance make cross-partition (sovereign-to-commercial cloud) failover architecturally complex.
Organizations struggle to discover and secure AI-powered applications across their infrastructure, especially shadow AI deployments that teams spin up without central oversight, creating security blind spots.
Standard defensive security tools miss logic flaws and vulnerabilities in APIs because they lack understanding of stateful API interactions and business logic flows.
Traditional WAFs force a trade-off between logging (risking missed attacks) and blocking (risking false positives), requiring extensive manual tuning to balance security coverage with availability.
Traditional bot-blocking approaches are insufficient for preventing account abuse (e.g., credential stuffing, fake account creation) because sophisticated attacks increasingly involve human-like behavior or actual humans, bypassing conventional bot detection.
Security teams were overwhelmed by the volume of raw security data across Cloudflare's platform, making it difficult to prioritize and act on vulnerabilities and threats efficiently.
Enterprise SASE (Secure Access Service Edge) migrations traditionally take 18+ months due to architectural complexity, requiring organizations to integrate networking and security across global infrastructure.
Cloudflare's open-source Pingora proxy had request smuggling vulnerabilities when deployed as an ingress proxy, allowing attackers to exploit HTTP parsing discrepancies to bypass security controls and route malicious requests.
Organizations struggle to migrate from legacy network security architectures to modern SASE (Secure Access Service Edge) solutions, facing risks from accumulated technical debt and complex dependencies in their existing infrastructure.
Organizations face fragmented data security across endpoints, network traffic, cloud applications, and AI prompts, making it difficult to enforce consistent data loss prevention (DLP) policies as data flows through diverse channels including RDP sessions and AI copilots.
Enterprises connecting multiple private networks via tunnels frequently encounter overlapping IP address ranges (e.g., multiple sites using 10.0.0.0/8), making traditional routing tables unable to determine which tunnel should receive return traffic.
Customers needed precise control over where their data is processed geographically to meet diverse compliance requirements (e.g., GDPR, data sovereignty laws), but existing pre-defined regional options were too coarse-grained to cover all regulatory and performance needs.
Security teams lacked a unified view across multiple Cloudflare datasets, making it difficult to identify and investigate multi-vector attacks that span different attack surfaces and log sources.
Italy's 'Piracy Shield' system forces Internet infrastructure providers like Cloudflare to block content at the network level without proper oversight or due process, leading to disproportionate overblocking of legitimate content.
Organizations struggle with Internet-facing blind spots in their attack surface, lacking continuous visibility into security gaps and risk exposures across their external-facing assets.
Messenger needed to protect user privacy when clicking links in chats while still detecting and warning users about malicious URLs, creating a tension between link safety scanning and end-to-end privacy.
Updating security-related APIs across millions of lines of code and thousands of engineers is extremely difficult at scale, especially when a single class of mobile vulnerability can be replicated across hundreds of locations in an Android codebase.