DeployU
Interviews / Cloud & DevOps / Container DNS resolution is failing intermittently. Troubleshoot and fix it.

Container DNS resolution is failing intermittently. Troubleshoot and fix it.

debugging Networking Interactive Quiz Code Examples

The Scenario

Your microservices are experiencing intermittent failures:

$ docker logs api-service
Error: getaddrinfo EAI_AGAIN user-service
Error: getaddrinfo ENOTFOUND payment-service
Connection to redis failed: EHOSTUNREACH

# Sometimes it works, sometimes it fails
$ docker exec api-service ping user-service
PING user-service (172.18.0.3): 56 data bytes
64 bytes from 172.18.0.3: seq=0 ttl=64 time=0.123 ms

# A minute later...
$ docker exec api-service ping user-service
ping: bad address 'user-service'

The failures are random - sometimes DNS works, sometimes it doesn’t. This is causing cascading failures across your services.

The Challenge

Diagnose the root cause of intermittent DNS resolution failures and implement a robust fix. Explain how Docker DNS works and common failure modes.

Wrong Approach

A junior engineer might restart all containers hoping it fixes itself, add retry logic everywhere to mask the issue, switch to using IP addresses instead of hostnames, or increase connection timeouts. These fail because restarts provide temporary relief at best, excessive retries add latency and complexity, IPs change on container restart, and timeouts don't fix DNS issues.

Right Approach

A senior engineer investigates the DNS infrastructure: check Docker's embedded DNS server, examine /etc/resolv.conf in containers, verify network configuration, check for DNS cache issues, and look for resource exhaustion. Common causes include ndots configuration, DNS server overload, or network driver issues. The fix addresses the root cause rather than working around it.

Understanding Docker DNS

Container DNS Resolution Flow:
1. App requests "user-service"
2. Query goes to Docker DNS (127.0.0.11)
3. Docker checks container names in network
4. If not found, forwards to host DNS
5. Response cached (default: 600s)

Step 1: Diagnose DNS Configuration

# Check container's DNS configuration
docker exec api-service cat /etc/resolv.conf

# Expected output:
nameserver 127.0.0.11  # Docker's embedded DNS
options ndots:0

# If you see external DNS servers, network might be misconfigured

Step 2: Test DNS Resolution

# Install dig/nslookup if not available
docker exec api-service apk add --no-cache bind-tools

# Test internal service resolution
docker exec api-service nslookup user-service
docker exec api-service dig user-service

# Test external resolution
docker exec api-service nslookup google.com

# Check DNS response time
docker exec api-service time nslookup user-service

Step 3: Common Issues and Fixes

Issue 1: ndots Causing Slow Resolution

# Check ndots setting
docker exec api-service cat /etc/resolv.conf
# options ndots:5  <- This is problematic!

With ndots:5, queries for “user-service” try:

  1. user-service.default.svc.cluster.local (timeout)
  2. user-service.svc.cluster.local (timeout)
  3. user-service.cluster.local (timeout)
  4. user-service.localdomain (timeout)
  5. user-service (finally works!)

Fix:

# docker-compose.yml
services:
  api:
    dns_opt:
      - ndots:1

Issue 2: Docker DNS Server Overwhelmed

# Check Docker daemon logs
sudo journalctl -u docker.service | grep -i dns

# Check for DNS-related errors
docker events --filter 'type=network'

Fix: Increase DNS cache or add local DNS cache:

services:
  dnsmasq:
    image: andyshinn/dnsmasq
    cap_add:
      - NET_ADMIN
    command: --cache-size=10000 --log-facility=-

  api:
    dns:
      - dnsmasq  # Use local cache first

Issue 3: Network Driver Issues

# Check network details
docker network inspect app-network

# Look for driver errors
docker network ls
docker network inspect bridge

Fix: Recreate network with explicit configuration:

# Remove and recreate
docker network rm app-network
docker network create \
  --driver bridge \
  --opt com.docker.network.driver.mtu=1450 \
  app-network

Step 4: Implement Application-Level Resilience

// DNS-aware retry logic
const dns = require('dns');
const { promisify } = require('util');
const resolve4 = promisify(dns.resolve4);

async function resolveWithRetry(hostname, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      const addresses = await resolve4(hostname);
      return addresses[0];
    } catch (err) {
      if (i === maxRetries - 1) throw err;
      // Exponential backoff
      await new Promise(r => setTimeout(r, 100 * Math.pow(2, i)));
    }
  }
}

// Connection pool with DNS refresh
const pool = mysql.createPool({
  host: 'database-service',
  // Refresh DNS periodically
  dns: {
    ttl: 30,  // Cache DNS for 30 seconds
    refreshInterval: 10000  // Refresh every 10 seconds
  }
});

Step 5: Docker Compose Best Practices

version: '3.8'

services:
  api:
    image: api:latest
    networks:
      - backend
    dns_opt:
      - ndots:1
      - timeout:2
      - attempts:3
    depends_on:
      user-service:
        condition: service_healthy

  user-service:
    image: user-service:latest
    networks:
      - backend
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3

networks:
  backend:
    driver: bridge
    driver_opts:
      com.docker.network.driver.mtu: 1450

DNS Debugging Commands

# Check DNS server status
docker exec api-service cat /etc/resolv.conf

# Test resolution
docker exec api-service nslookup user-service

# Check network connectivity
docker exec api-service ping -c 3 user-service

# Trace DNS queries
docker exec api-service tcpdump -i eth0 port 53

# Check container's network
docker inspect api-service --format='{{json .NetworkSettings.Networks}}'

Common DNS Issues

SymptomCauseFix
EAI_AGAINDNS server timeoutIncrease timeout, add retries
ENOTFOUNDService not on same networkCheck network membership
Slow resolutionHigh ndots valueSet ndots:1
Intermittent failuresDNS cache issuesRestart Docker daemon
External DNS failsNetwork isolationCheck NAT and firewall

Practice Question

Why might a container fail to resolve 'user-service' hostname even though both containers are running?