Kubernetes | awbuana

The DNS Mystery: Five Factors and a Ruby Gem

“Connection reset by peer.” If you’ve run applications in Kubernetes long enough, you’ve probably seen this error. Usually it’s a service that went away, a network hiccup, something transient. You retry, it works, you move on. But what if it keeps happening? What if it only happens with DNS lookups, and only sometimes, and only in production? That’s where I found myself a few weeks ago. The Event Our Ruby application started throwing intermittent Errno::ECONNRESET errors. Not on HTTP requests to external APIs — on DNS lookups. The stack trace pointed to getaddrinfo, the standard libc function for resolving hostnames. ...

CPU Limits vs Memory Limits: When 'Survival' Means Different Things

In my previous post, I said: “CPU limits are about performance. Memory limits are about survival.” I stand by that statement, but I oversimplified what “survival” actually means depending on what kind of application you’re running. Let me break this down. Stateless Applications: The “Easy” Case When I wrote about memory limits being “about survival,” I was thinking primarily of stateless services. You know, the typical microservices: APIs, web servers, queue consumers. ...

kubernetes devops architecture stateful-apps reliability

The Resource Request You Think Is Saving Money Is Actually Breaking Your App

I thought I was being clever. When we migrated our services to Google Kubernetes Engine with auto scale profile optimized, I looked at our resource specs and saw an opportunity. Our pods were requesting 100m CPU but had limits set to 1000m. Ten times headroom! Surely we could tighten that up and save some money. So I did what seemed logical: I kept the limits high (just in case of traffic spikes) but dropped the requests even lower. 50m here, 25m there. The cluster was happy. Our costs went down. I patted myself on the back for being such a savvy engineer. ...

kubernetes devops performance gke cost-optimization

Why Your Pod Died (OOMKilled): The Difference Between CPU and Memory Limits

I used to think CPU and memory were the same. Not literally, of course. I knew one was for processing and one was for… well, memory. But when it came to Kubernetes resource limits, I treated them identically. Set a request, set a limit, let the scheduler do its thing. If the app needs more, it uses more, right? Wrong. Very, very wrong. And I learned this lesson at 2 AM on a Tuesday, when our primary API service went from “healthy” to CrashLoopBackOff in about 30 seconds. No warning. No graceful degradation. Just… dead. Then alive. Then dead again. ...

kubernetes devops memory performance troubleshooting

The DNS Query That Wouldn't Stop: Debugging GKE's Hidden Ndots Problem

“Why is our DNS resolution so slow?” I remember staring at that Slack message, coffee going cold, wondering if I’d missed something obvious. We’d been running on Google Kubernetes Engine for months without issues. Then suddenly, DNS lookups were timing out. Services couldn’t reach each other. External APIs were failing. The Discovery A teammate noticed intermittent 5xx errors from one of our microservices. “Network issues,” they said. “Probably transient.” I wish it had been transient. ...

kubernetes gke dns infrastructure debugging