AWS

How Generali Malaysia optimizes operations with Amazon EKS

Generali Malaysia needed to optimize Kubernetes operations on AWS while reducing operational overhead, managing costs, and improving security posture.

distributed-systems security
4 min
Airbnb

What COVID did to our forecasting models (and what we built to handle the next shock)

Building forecasting models that remain accurate during sudden market shocks like a global pandemic, where historical data no longer predicts future outcomes.

ml-systems observability
5 min
Cloudflare

A one-line Kubernetes fix that saved 600 hours a year

Cloudflare's Atlantis instance took 30 minutes to restart due to a Kubernetes volume permission bottleneck.

observability storage-systems
4 min
Dropbox

Reducing our monorepo size to improve developer velocity

Monorepo growth was causing increased build times, slower dependency resolution, and reduced developer velocity as the codebase expanded.

general observability
3 min
AWS

AI-powered event response for Amazon EKS

Responding to operational events in Amazon EKS clusters is often manual, slow, and requires deep expertise, making it difficult to handle incidents at scale across complex Kubernetes environments.

observability ml-systems
3 min
Airbnb

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

Airbnb's reliance on multiple third-party observability vendors resulted in inconsistent data, fragmented developer experiences, and limitations in cost-effectiveness and reliability at their scale.

observability microservices
5 min
Cloudflare

Building a security overview dashboard for actionable insights

Security teams were overwhelmed by the volume of raw security data across Cloudflare's platform, making it difficult to prioritize and act on vulnerabilities and threats efficiently.

security observability
3 min
Cloudflare

Investigating multi-vector attacks in Log Explorer

Security teams lacked a unified view across multiple Cloudflare datasets, making it difficult to identify and investigate multi-vector attacks that span different attack surfaces and log sources.

observability security
3 min
Airbnb

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

Airbnb's Observability as Code alert development process had excessively long development cycles (weeks) due to cumbersome code review workflows, slowing down engineers' ability to create and iterate on alerts at scale across thousands of services.

observability microservices
5 min
AWS

6,000 AWS accounts, three people, one platform: Lessons learned

Managing 6,000 AWS accounts for a multi-tenant serverless SaaS platform with only three people created massive operational challenges around automation, observability, and cost management at scale.

distributed-systems microservices
4 min
Meta

The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It

Agentic (AI-driven) software development produces and ships code so fast that traditional testing frameworks cannot keep pace, leaving bugs uncaught as they land in rapidly evolving codebases.

ml-systems observability
5 min
AWS

Architecting conversational observability for cloud applications

Diagnosing and resolving issues in complex Kubernetes clusters is slow and requires expert knowledge, leading to high Mean Time to Recovery (MTTR) and heavy reliance on specialized engineers for root cause analysis.

observability ml-systems
4 min