The DNS Mystery: Five Factors and a Ruby Gem

“Connection reset by peer.” If you’ve run applications in Kubernetes long enough, you’ve probably seen this error. Usually it’s a service that went away, a network hiccup, something transient. You retry, it works, you move on. But what if it keeps happening? What if it only happens with DNS lookups, and only sometimes, and only in production? That’s where I found myself a few weeks ago. The Event Our Ruby application started throwing intermittent Errno::ECONNRESET errors. Not on HTTP requests to external APIs — on DNS lookups. The stack trace pointed to getaddrinfo, the standard libc function for resolving hostnames. ...

March 21, 2026 · awbuana

The DNS Query That Wouldn't Stop: Debugging GKE's Hidden Ndots Problem

“Why is our DNS resolution so slow?” I remember staring at that Slack message, coffee going cold, wondering if I’d missed something obvious. We’d been running on Google Kubernetes Engine for months without issues. Then suddenly, DNS lookups were timing out. Services couldn’t reach each other. External APIs were failing. The Discovery A teammate noticed intermittent 5xx errors from one of our microservices. “Network issues,” they said. “Probably transient.” I wish it had been transient. ...

March 19, 2025 · awbuana