Kubernetes at Scale: Hard Lessons from Running 10,000+ Pods

February 22, 2025

Resource Requests and Limits

Resource requests are the foundation of Kubernetes scheduling. The scheduler uses requests, not actual utilization, to make placement decisions. Inaccurate requests cascade into scheduling failures, bin-packing inefficiency, and node-level resource contention.

Scheduling Decision Flow
========================

Pod enters scheduling queue
         |
         v
+------------------------------+
|  Filtering Phase             |
|  "Can this node run the pod?"|
|                              |
|  Node allocatable:  3.7 CPU  |
|  Already reserved:  2.5 CPU  |
|  Remaining:         1.2 CPU  |
|                              |
|  Pod requests 2.0 CPU?       |
|  1.2 < 2.0 --> REJECT        |
|                              |
|  Pod requests 0.5 CPU?       |
|  1.2 >= 0.5 --> ACCEPT       |
+------------------------------+
         |
         v
+------------------------------+
|  Scoring Phase               |
|  "Which node is best?"       |
|                              |
|  LeastRequested score        |
|  BalancedAllocation score    |
|  TopologySpread score        |
|  NodeAffinity score          |
|                              |
|  Weighted sum --> winner     |
+------------------------------+
         |
         v
   Pod bound to node

A basic resource specification:

resources:
  requests:
    cpu: 500m      # Scheduler reserves this amount on the node
    memory: 256Mi  # Scheduler reserves this amount on the node
  limits:
    cpu: 1000m     # CFS bandwidth control enforced here
    memory: 512Mi  # OOM killer enforced here

CFS Throttling and CPU Limits

The Linux Completely Fair Scheduler (CFS) bandwidth controller operates on a configurable period, defaulting to 100ms. When a container has a CPU limit of 1000m, it receives a quota of 100ms of CPU time per 100ms period. If the container exhausts this quota in a burst (garbage collection, request batching, JIT compilation), the kernel throttles all threads in that cgroup for the remainder of the period. This occurs regardless of available CPU capacity on the node.

CFS Throttling Mechanism (100ms default period)
================================================

CPU limit: 1000m = 100ms quota per 100ms period

Period 1             Period 2             Period 3
|---- 100ms ------| |---- 100ms ------| |---- 100ms ------|

[################]   [########........]   [##########......]
 100ms consumed       80ms consumed        95ms consumed
 THROTTLED 0ms        OK, 20ms unused      OK, 5ms unused
      |
      +-- All threads frozen until next period boundary.
          Node may have 6 idle cores. Does not matter.
          Quota is per-cgroup, not per-node.

Impact on request latency:

  Request arrives at t=92ms into period
  Remaining quota: 8ms
  Request needs: 25ms of CPU
  Result:
    t=92ms:  Start processing (8ms quota remains)
    t=100ms: Quota exhausted, thread frozen
    t=100ms: New period begins, 100ms quota refreshed
    t=117ms: Processing completes (8ms + 17ms = 25ms CPU)
    Total wall time: 25ms elapsed
    Without throttling: 25ms elapsed
    Visible penalty: 0ms (got lucky, small request)

  Worse case: GC pause starts at t=5ms, burns 95ms
    t=100ms: Quota exhausted, thread frozen
    t=100ms: New period, another 40ms needed
    t=140ms: GC completes
    Application thread resumes, processes request
    User-visible latency: extra 40ms+ added to any
    request that overlapped with the GC pause

When CFS throttling kicks in on a latency-sensitive pod, standard monitoring typically shows low average CPU utilization. The throttling is invisible unless specifically measured.

# Direct cgroup inspection (cgroup v2)
kubectl exec -it <pod> -- cat /sys/fs/cgroup/cpu.stat
# Fields of interest:
#   nr_throttled   - number of times the cgroup was throttled
#   throttled_usec - total time spent throttled in microseconds
 
# cgroup v1 (older kernels)
kubectl exec -it <pod> -- cat /sys/fs/cgroup/cpu/cpu.stat
# Fields: nr_throttled, throttled_time
 
# Prometheus query for throttle ratio
# Values above 0.05 (5%) indicate a problem
rate(container_cpu_cfs_throttled_periods_total{container="app"}[5m])
  /
rate(container_cpu_cfs_periods_total{container="app"}[5m])

Resource Strategy Comparison

StrategyCPU RequestsCPU LimitsMemory LimitsQoS ClassUse Case
Guaranteed1000m1000m512Mi (= req)GuaranteedLatency-critical, databases
Burstable, no CPU limit500m(none)512MiBurstableAPI servers, web services
Burstable, with CPU limit500m1000m512MiBurstableBackground workers, batch
BestEffort(none)(none)(none)BestEffortDevelopment only, never production

For latency-sensitive workloads, omitting CPU limits prevents CFS throttling while still providing scheduling guarantees through requests:

# Recommended for latency-sensitive services
resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    # CPU limit intentionally omitted to prevent CFS throttling
    memory: 512Mi  # Memory limits should always be set

Memory limits must always be set. Exceeding a memory limit triggers the OOM killer, which sends SIGKILL (not SIGTERM). No graceful shutdown occurs. Set memory limits to 1.3x-2x the observed steady-state working set.

Determining Correct Request Values

Use Vertical Pod Autoscaler in recommendation-only mode or query Prometheus directly:

# CPU request target: P95 usage over 7 days
quantile_over_time(0.95,
  rate(container_cpu_usage_seconds_total{
    container="app", namespace="production"
  }[5m])[7d:]
)

# Memory request target: P99 usage over 7 days
# Higher percentile because memory overshoot triggers OOM kill
quantile_over_time(0.99,
  container_memory_working_set_bytes{
    container="app", namespace="production"
  }[7d:]
)

VPA in auto mode should not be used for production workloads. It restarts pods to apply new resource values, which can cause connection resets during in-flight requests.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Off"  # Recommendation only, no automatic resizing

Horizontal Pod Autoscaler

Scaling Latency

The time between a traffic increase and new capacity becoming available is typically 60-120 seconds. This delay is composed of multiple sequential stages:

HPA Scaling Timeline
=====================

t=0s    Traffic spike arrives
        Existing pods absorb load, CPU rises
        |
t=15s   cAdvisor scrapes container metrics (default 15s interval)
        |
t=30s   metrics-server aggregates data (15s window)
        |
t=45s   HPA controller evaluates metrics (default 15s sync period)
        HPA calculates desired replica count
        HPA issues scale request to Deployment
        |
t=47s   Scheduler assigns new pods to nodes
        |
t=47s   Kubelet begins image pull
 to     Image size matters:
t=60s     50MB Alpine-based: ~3s pull
          200MB JDK-based:   ~8s pull
          2GB full OS + deps: 30s+ pull
        |
t=60s   Container runtime starts container
        Application initialization begins
          Go binary:    ~1s
          Node.js:      ~3s
          JVM (Spring): 15-45s
        |
t=65s   Readiness probe begins checking
 to     initialDelaySeconds + periodSeconds * successThreshold
t=80s   |
        v
t=80s   Pod passes readiness, added to Endpoints object
        kube-proxy/IPVS rules updated on all nodes
        |
t=85s   New pod begins receiving traffic
        =====================================
        85 seconds minimum from spike to relief

HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 10
  maxReplicas: 100
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0     # Scale up without delay
      policies:
        - type: Percent
          value: 100                    # Allow doubling capacity per step
          periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300   # 5 minute cooldown before scale-down
      policies:
        - type: Percent
          value: 10                     # Remove at most 10% per period
          periodSeconds: 60
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60        # Scale at 60%, not 80%

Key parameters:

  • minReplicas should be set to handle baseline traffic plus 50% headroom. These pods absorb the initial spike during the 60-120 second autoscaling delay.
  • Scale-up should be fast (no stabilization window). Scale-down should be slow (5+ minute stabilization) to prevent flapping.
  • Target utilization at 60% provides headroom. At 80% target utilization, a 30% traffic increase pushes pods to 104% before the HPA reacts.

Cascading Failure Under Autoscaling

When pods become overloaded, readiness probes can fail, causing Kubernetes to remove them from service endpoints. This increases load on remaining pods, triggering further readiness failures.

Cascading Failure Progression
==============================

State 0: Steady state
  10 pods, 100 req/s each, 50% CPU
  Total capacity: 1000 req/s

State 1: Traffic spike (t=0s)
  1500 req/s total
  Each pod: 150 req/s, ~75% CPU
  HPA triggered, scaling in progress

State 2: First failure (t=30s)
  Pod-3 fails readiness probe (overloaded)
  Removed from endpoints
  9 pods handling 1500 req/s
  Each pod: 167 req/s, ~83% CPU

State 3: Cascade begins (t=45s)
  Pod-7 fails readiness, removed
  8 pods, 188 req/s each, ~94% CPU
  Pod-1 OOMKilled (memory spike under load)
  7 pods, 214 req/s each, over capacity

State 4: Collapse (t=60s)
  3 more pods fail readiness or OOM
  4 pods remain, completely saturated
  Effectively zero successful responses

State 5: Recovery attempt (t=90s)
  HPA scaled to 15 replicas
  New pods starting, but surviving pods
  cannot serve traffic including health checks
  Recovery takes 3-5x longer than the spike duration

Mitigations:

# Tolerant readiness probes prevent cascade removal
readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 6     # Must fail for 60s before removal
  successThreshold: 1
 
# Pre-scale for known traffic events using CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: prescale-peak-hours
spec:
  schedule: "30 8 * * 1-5"  # 8:30 AM weekdays, 30 min before peak
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: prescaler
              image: bitnami/kubectl:latest
              command:
                - kubectl
                - patch
                - hpa/api-server
                - --type=merge
                - -p
                - '{"spec":{"minReplicas":25}}'
          restartPolicy: OnFailure

Application-level circuit breakers should return 503 immediately under overload rather than queuing requests that will timeout. Fast failure preserves capacity for requests that can succeed.

DNS Resolution

The ndots Problem

The default ndots value in Kubernetes is 5. Any hostname with fewer than 5 dots is treated as a relative name, and the resolver appends each search domain before trying the absolute name.

DNS Resolution with ndots:5 (default)
=======================================

Application resolves: api.stripe.com  (2 dots, fewer than 5)

Query sequence:
  1. api.stripe.com.default.svc.cluster.local    A + AAAA  --> NXDOMAIN
  2. api.stripe.com.svc.cluster.local             A + AAAA  --> NXDOMAIN
  3. api.stripe.com.cluster.local                  A + AAAA  --> NXDOMAIN
  4. api.stripe.com.us-west-2.compute.internal     A + AAAA  --> NXDOMAIN
  5. api.stripe.com.                               A + AAAA  --> RESOLVED

Total DNS packets: 10 (5 names x 2 record types)
Wasted queries: 8 out of 10

At scale:
  10,000 pods x 50 external calls/sec x 10 packets/call
  = 5,000,000 DNS packets/sec hitting CoreDNS

DNS Resolution with ndots:2 (recommended)
==========================================

Application resolves: api.stripe.com  (2 dots, equal to ndots)

Query sequence:
  1. api.stripe.com.                               A + AAAA  --> RESOLVED

Total DNS packets: 2
Wasted queries: 0

Internal names still work:
  my-service.my-namespace  (1 dot, fewer than 2)
  Search path appended: my-service.my-namespace.svc.cluster.local --> RESOLVED

Configuration:

# Pod-level DNS configuration
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
      - name: single-request-reopen

The single-request-reopen option addresses a conntrack race condition. When the resolver sends A and AAAA queries simultaneously on the same UDP socket, both outgoing packets may receive the same conntrack entry source port. The kernel drops the second reply because it appears to be a duplicate. The resolver then waits for a 5-second timeout before retrying. single-request-reopen forces the resolver to use a new socket for the second query, avoiding the collision.

Conntrack Race Condition
=========================

Without single-request-reopen:

  App socket (port 12345)
    |
    +-- Send A query     --> conntrack entry: src=12345 dst=53
    +-- Send AAAA query  --> conntrack: SAME src=12345 dst=53
    |
    Reply to A:    conntrack matches, delivered to app
    Reply to AAAA: conntrack says "already seen reply", DROPS packet
    |
    App waits 5 seconds for AAAA timeout
    Retries on new socket
    Total resolution time: ~5000ms instead of ~1ms

With single-request-reopen:

  App socket 1 (port 12345)
    +-- Send A query     --> conntrack entry: src=12345 dst=53
  App socket 2 (port 12346)
    +-- Send AAAA query  --> conntrack entry: src=12346 dst=53
    |
    Both replies delivered correctly
    Total resolution time: ~1ms

CoreDNS at Scale

CoreDNS Scaling Architecture
==============================

Tier 1: Per-node cache (NodeLocal DNSCache DaemonSet)
+----------+   +----------+   +----------+   +----------+
| Node 1   |   | Node 2   |   | Node 3   |   | Node N   |
|          |   |          |   |          |   |          |
| Pod Pod  |   | Pod Pod  |   | Pod Pod  |   | Pod Pod  |
|  |   |   |   |  |   |   |   |  |   |   |   |  |   |   |
|  v   v   |   |  v   v   |   |  v   v   |   |  v   v   |
| NodeLocal|   | NodeLocal|   | NodeLocal|   | NodeLocal|
| DNS Cache|   | DNS Cache|   | DNS Cache|   | DNS Cache|
+----+-----+   +----+-----+   +----+-----+   +----+-----+
     |              |              |              |
     +--------------+--------------+--------------+
                         |
                         v
Tier 2: Cluster CoreDNS (scaled replicas)
     +--------------------------------------+
     | CoreDNS (6-10 replicas)              |
     | Plugins: autopath, cache, forward    |
     +--------------------------------------+
                         |
                         v
Tier 3: Upstream DNS (VPC resolver, cloud DNS)

NodeLocal DNSCache runs as a DaemonSet, providing a per-node caching resolver. Cache hits never leave the node, eliminating cross-node network latency and reducing CoreDNS query volume by approximately 80%.

The autopath CoreDNS plugin optimizes search domain resolution by returning the correct answer on the first query rather than requiring the client to iterate through all search domains.

CoreDNS replica count should scale with cluster size. A rough guideline: 1 CoreDNS pod per 500 application pods, with a minimum of 3 replicas for redundancy.

etcd Performance

etcd backs all Kubernetes state. Every API server operation translates to a read or write against etcd's Raft-replicated key-value store.

etcd Write Path
================

kubectl apply -f deployment.yaml
         |
         v
+------------------+
|  API Server      |
|  1. Authenticate |
|  2. Authorize    |
|  3. Validate     |
|  4. Admit        |
+--------+---------+
         |
         v  Write request
+------------------------------------------+
|             etcd Cluster                 |
|                                          |
|  +--------+    +--------+    +--------+  |
|  | Leader  |--->|Follower|--->|Follower|  |
|  |         |<---|   1    |    |   2    |  |
|  +--------+    +--------+    +--------+  |
|                                          |
|  1. Leader receives write                |
|  2. Leader appends to local WAL          |
|  3. Leader replicates to followers       |
|  4. Majority (2/3) confirm              |
|  5. Leader commits, responds to client   |
|                                          |
|  Write latency = WAL fsync + network RTT |
|  Target: < 10ms for p99                  |
+------------------------------------------+

Performance-Critical Factors

Disk latency. etcd calls fdatasync on every write to the Write-Ahead Log. This is required by the Raft consensus protocol for durability guarantees. Shared or network-attached storage introduces latency that directly impacts every Kubernetes API operation. Dedicated NVMe SSDs are required for production etcd nodes. Target: p99 WAL fsync latency below 10ms.

# Check etcd disk fsync latency via Prometheus
# (available on self-managed clusters)
histogram_quantile(0.99,
  rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)
# Values above 0.01 (10ms) indicate disk performance problems
 
# On managed Kubernetes, monitor API server latency as a proxy
histogram_quantile(0.99,
  rate(apiserver_request_duration_seconds_bucket{verb="POST"}[5m])
)
# Sustained values above 0.5s suggest etcd pressure

Object count and watch connections. Each controller, kubelet, and operator maintains watch connections to the API server, which correspond to watches on etcd. A 500-node cluster with standard controllers and operators can sustain 50,000+ active watches. Each watch consumes memory on both the API server and etcd.

# Count objects by resource type, sorted by count
kubectl api-resources --verbs=list -o name | while read r; do
  count=$(kubectl get "$r" -A --no-headers 2>/dev/null | wc -l)
  [ "$count" -gt 0 ] && echo "$count $r"
done | sort -rn | head -20

Object size. etcd has a default maximum request size of 1.5MB. ConfigMaps and Secrets approaching this limit cause write failures with etcdserver: request is too large. Large data should be stored externally (S3, a database) with references in Kubernetes objects. Secrets are base64-encoded, which inflates size by approximately 33%.

Compaction and defragmentation. etcd maintains a revision history for all keys to support watch functionality. Without periodic compaction, the database grows continuously. On self-managed clusters, configure automatic compaction:

# etcd startup flags for compaction
--auto-compaction-retention=1h
--auto-compaction-mode=periodic

# After compaction, defragment to reclaim disk space
etcdctl defrag --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379

On managed Kubernetes (EKS, GKE, AKS), compaction is handled by the provider.

Networking

CNI Plugin Selection

The CNI plugin determines how pod networking is implemented. At scale, the choice has significant performance implications.

iptables-based Service Routing (kube-proxy default)
=====================================================

Service with 3 endpoints:

KUBE-SERVICES chain:
  Rule 1: -d 10.96.0.1/32 -p tcp --dport 443 --> KUBE-SVC-xxx
  Rule 2: -d 10.96.0.10/32 -p tcp --dport 53  --> KUBE-SVC-yyy
  ...
  Rule N: one rule per Service ClusterIP

KUBE-SVC-xxx chain:
  Rule 1: -m statistic --probability 0.333 --> KUBE-SEP-aaa
  Rule 2: -m statistic --probability 0.500 --> KUBE-SEP-bbb
  Rule 3:                                  --> KUBE-SEP-ccc

KUBE-SEP-aaa: DNAT to 10.244.1.5:8080
KUBE-SEP-bbb: DNAT to 10.244.2.8:8080
KUBE-SEP-ccc: DNAT to 10.244.3.2:8080

Rule count growth:
  Services    iptables Rules    iptables-save Time    Rule Update Time
  50          ~500              < 1s                  < 1s
  500         ~5,000            ~2s                   ~2s
  2,000       ~20,000           ~5s                   ~4s
  5,000       ~50,000           ~8s                   ~5s
  10,000      ~100,000          ~15s                  ~10s

During rule updates, packet processing stalls.

CNI Comparison

FeatureCalico (iptables)Calico (eBPF)Cilium (eBPF)AWS VPC CNI
Service routingiptableseBPFeBPFiptables/eBPF
Network PolicyYesYesYes (extended)Via Calico addon
EncryptionWireGuardWireGuardWireGuard/IPsecNone (VPC level)
ObservabilityBasicFlow logsHubble (L7)VPC Flow Logs
Max pods/nodeOverlay dependentOverlay dependentOverlay dependentENI limited
Scale tested5,000+ nodes5,000+ nodes5,000+ nodes750 nodes (EKS)
Service routing latency (relative)Baseline40% lower40% lowerVaries

eBPF-based dataplanes (Cilium, Calico eBPF) attach programs directly to network interfaces, bypassing iptables entirely. Service routing operations become O(1) hash lookups instead of O(n) chain traversals. This eliminates the scaling problems with iptables at high service counts.

Conntrack Table Exhaustion

The conntrack table tracks active network connections for NAT and stateful firewalling. Each entry consumes approximately 288 bytes. The default maximum is typically 131,072 entries (nf_conntrack_max).

Conntrack Table Pressure
==========================

Each active connection = 1 conntrack entry
Each DNAT (Service routing) = 1 conntrack entry
Each DNS query (UDP) = 1 conntrack entry (with timeout)

10,000 pods, each maintaining:
  5 persistent connections to databases
  10 connections to other services
  2 DNS queries/sec (conntrack timeout: 30s = 60 entries)

Per pod: ~75 conntrack entries
Total: 750,000 entries

Default nf_conntrack_max: 131,072
Result: conntrack table full, new connections DROPPED silently

Symptoms:
  - Intermittent connection timeouts
  - "nf_conntrack: table full, dropping packet" in dmesg
  - No errors visible at application layer
# Check conntrack utilization on a node
kubectl debug node/<node-name> -it --image=busybox -- \
  cat /proc/sys/net/netfilter/nf_conntrack_count
kubectl debug node/<node-name> -it --image=busybox -- \
  cat /proc/sys/net/netfilter/nf_conntrack_max
 
# Prometheus alert for conntrack pressure
# Alert when conntrack usage exceeds 80% of maximum
node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8

Increase nf_conntrack_max via sysctl tuning on node pools, or migrate to an eBPF-based CNI that does not rely on conntrack for service routing.

Network Policies

By default, all pod-to-pod communication is allowed. A default-deny policy should be applied to every production namespace, with explicit allow rules for required communication paths.

# Default deny all ingress and egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
 
---
# Explicit allow: api-server ingress from ingress controller,
# egress to database and DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-server-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - port: 8080
          protocol: TCP
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - port: 5432
          protocol: TCP
    - to:    # DNS must be explicitly allowed under default-deny
        - namespaceSelector: {}
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP

DNS egress must be explicitly allowed. Without it, all name resolution fails and every outbound connection attempt hangs until timeout.

Service Mesh Overhead

Sidecar-based service meshes (Istio, Linkerd) add a proxy container to every pod. The resource overhead scales linearly with pod count.

Service Mesh Resource Overhead at Scale
=========================================

Per-pod sidecar cost (Envoy/Istio):
  Memory: 100-200 MB
  CPU:    50-100m
  Latency: 1-5ms per hop (both directions)

At 10,000 pods:
  Memory: 10,000 x 150MB = 1.5 TB dedicated to proxies
  CPU:    10,000 x 75m   = 750 cores dedicated to proxies
  Cost:   Approximately $15,000-25,000/month in compute

Alternative: Cilium network policies + application-level retries
  Memory overhead: 0 per pod (DaemonSet-based)
  Latency overhead: < 0.1ms (kernel-level processing)
  mTLS: Available via Cilium mutual authentication or WireGuard

Provides approximately 80% of service mesh functionality
at approximately 10% of the resource cost.

A service mesh is justified when L7 traffic management (header-based routing, traffic mirroring, fault injection) is a hard requirement. For mTLS and network policy alone, CNI-level solutions are more efficient.

Pod Disruption Budgets and Topology Spread

Pod Disruption Budgets

PDBs control how many pods of a workload can be simultaneously unavailable during voluntary disruptions (node drains, cluster upgrades, autoscaler scale-down).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: api-server

With maxUnavailable: 1, kubectl drain will evict at most one matching pod at a time and block until a replacement is running before evicting the next.

PDBs do not protect against involuntary disruptions (node hardware failure, kernel panic, OOM kill). Topology spread constraints address this gap.

Topology Spread Constraints

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: api-server
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: api-server
Effect of Topology Spread on Failure Domains
===============================================

Without constraints:            With constraints:

Node-1 (zone-a)                 Node-1 (zone-a)
  +- api-pod-1                    +- api-pod-1
  +- api-pod-2
  +- api-pod-3                  Node-2 (zone-a)
                                  +- api-pod-2
Node-2 (zone-a)
  +- (other pods)               Node-3 (zone-b)
                                  +- api-pod-3
Node-3 (zone-b)
  +- (other pods)               Node-4 (zone-b)
                                  +- api-pod-4

Node-1 failure:                 Node-1 failure:
  3/3 api pods lost               1/4 api pods lost
  100% outage                     75% capacity maintained

Zone-a failure:                 Zone-a failure:
  3/3 api pods lost               2/4 api pods lost
  100% outage                     50% capacity maintained

The DoNotSchedule policy for hostname spread is strict: the scheduler will leave a pod unschedulable rather than violate the constraint. The ScheduleAnyway policy for zone spread is best-effort: the scheduler prefers balanced placement but will place pods in an imbalanced configuration if no better option exists.

Graceful Shutdown

When Kubernetes terminates a pod, two concurrent processes begin: SIGTERM delivery to the container and removal from Service endpoints. These processes are not synchronized.

Pod Termination Sequence
=========================

API server sets pod status to Terminating
         |
         +----> Kubelet receives update
         |        |
         |        +-> Execute preStop hook (if defined)
         |        +-> Send SIGTERM to PID 1
         |        |
         |        +-> Start terminationGracePeriodSeconds countdown
         |
         +----> Endpoints controller removes pod from Endpoints
                  |
                  +-> kube-proxy updates iptables/IPVS on each node
                  |   (propagation delay: 1-10 seconds across cluster)
                  |
                  +-> Ingress controllers remove from upstream list
                      (propagation delay: 1-15 seconds depending on
                       sync interval and controller implementation)

Timeline with preStop sleep:

t=0s    Pod marked Terminating
        preStop hook executes: sleep 5
        Endpoints controller begins removal

t=0-5s  Endpoint removal propagates across cluster
        Pod still running, still accepting connections
        In-flight requests continue processing

t=5s    preStop hook completes
        SIGTERM delivered to application
        Application begins draining connections
        (stop accepting new connections, finish existing ones)

t=5-25s Application drains in-flight requests

t=25s   Application exits cleanly

t=30s   terminationGracePeriodSeconds expires (default)
        If process still running: SIGKILL sent

Required configuration for zero-downtime termination:

spec:
  terminationGracePeriodSeconds: 60  # Must exceed preStop + drain time
  containers:
    - name: app
      lifecycle:
        preStop:
          exec:
            command: ["sh", "-c", "sleep 5"]

Application-side SIGTERM handling (Go example):

quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
 
<-quit
log.Println("SIGTERM received, starting drain")
 
// Stop accepting new connections
server.SetKeepAlivesEnabled(false)
 
// Allow in-flight requests to complete
ctx, cancel := context.WithTimeout(context.Background(), 50*time.Second)
defer cancel()
 
if err := server.Shutdown(ctx); err != nil {
    log.Printf("Forced shutdown: %v", err)
}
log.Println("Drain complete, exiting")

The terminationGracePeriodSeconds value must be greater than preStop sleep + maximum expected drain time. If the application has not exited when the grace period expires, the container receives SIGKILL.

Rolling Updates

The default rolling update strategy (maxSurge: 25%, maxUnavailable: 25%) removes up to 25% of existing pods before replacements are confirmed healthy. For a 100-pod deployment, 25 pods can be simultaneously unavailable.

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 10%
      maxUnavailable: 0    # Never reduce capacity during rollout
  minReadySeconds: 30      # Pod must be Ready for 30s before considered Available
Rolling Update with maxUnavailable: 0
=======================================

Initial: 10 pods v1, all Ready

Step 1: Create 1 new pod (v2)
  v1: [R] [R] [R] [R] [R] [R] [R] [R] [R] [R]
  v2: [Starting...]
  Serving capacity: 10 pods (100%)

Step 2: v2 pod passes readiness, waits minReadySeconds
  v1: [R] [R] [R] [R] [R] [R] [R] [R] [R] [R]
  v2: [R] (waiting 30s)
  Serving capacity: 11 pods (110%)

Step 3: v2 pod available (30s elapsed), terminate 1 v1 pod
  v1: [R] [R] [R] [R] [R] [R] [R] [R] [R] [Terminating]
  v2: [R]
  Serving capacity: 10 pods (100%)

Step 4: Repeat until all pods are v2
  Capacity never drops below 100%

Compare with maxUnavailable: 25%:
  Step 1: Immediately terminate 2-3 v1 pods, start v2 pods
  Capacity drops to 70-80% during transition
  If v2 has a startup bug: 25% capacity lost to broken pods

minReadySeconds provides a safety window. Pods that start successfully but crash after 10-15 seconds (due to configuration errors, nil pointer dereferences, or dependency failures) are caught before old pods are terminated.

Observability

Metrics, Logs, and Traces Architecture

Observability Pipeline
========================

+-------------------+
| Application Pods  |
| (instrumented     |
|  with OTel SDK)   |
+---+------+-----+--+
    |      |     |
    |      |     +-------> Traces (OTLP/gRPC)
    |      |                    |
    |      +-----------> Logs (stdout/stderr)
    |                       |   |
    +-----------------> Metrics (Prometheus scrape)
    |                   |       |
    v                   v       v
+----------+    +----------+  +----------+
|Prometheus |    | Fluent   |  |  OTel    |
|  (scrape) |    | Bit      |  | Collector|
|           |    | DaemonSet|  |          |
+-----+----+    +-----+----+  +-----+----+
      |               |             |
      v               v             v
+----------+    +----------+  +----------+
| Thanos   |    | Loki     |  | Tempo    |
| (long    |    | (log     |  | (trace   |
|  term)   |    |  store)  |  |  store)  |
+-----+----+    +-----+----+  +-----+----+
      |               |             |
      +-------+-------+-------+-----+
              |
              v
      +---------------+
      |   Grafana     |
      | Correlation:  |
      | trace_id ties |
      | metrics, logs,|
      | traces into a |
      | single view   |
      +---------------+

Metrics Cardinality

Prometheus stores one time series per unique label combination. Unbounded labels (user IDs, request IDs, IP addresses) cause cardinality explosion.

Cardinality Calculation
========================

Labels: {method, endpoint, status_bucket}

method:        4 values  (GET, POST, PUT, DELETE)
endpoint:     50 values  (/api/v1/users, /api/v1/orders, ...)
status_bucket: 5 values  (2xx, 3xx, 4xx, 5xx, other)

Total series: 4 x 50 x 5 = 1,000 series per metric
At 15s scrape interval: 1,000 x 4 samples/min = manageable

Adding user_id label (1,000,000 unique users):
Total series: 4 x 50 x 5 x 1,000,000 = 1,000,000,000 series
Prometheus memory: ~2 KB per series = 2 TB RAM
Result: OOM crash

Rules for label cardinality:

  • Labels should have bounded, low-cardinality values (typically fewer than 100 unique values per label).
  • High-cardinality identifiers (user_id, trace_id, request_id) belong in logs and traces, not metrics.
  • Use status_bucket (2xx, 3xx, 4xx, 5xx) instead of raw status_code (200, 201, 204, 301, ...).

Structured Logging

All log output should be structured JSON with a trace_id field for correlation:

{
  "level": "info",
  "ts": "2025-06-22T14:30:05.123Z",
  "msg": "request processed",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "service": "api-gateway",
  "method": "POST",
  "path": "/api/v1/orders",
  "status": 201,
  "latency_ms": 45,
  "bytes_out": 892
}

The trace_id enables cross-system correlation: identify a latency spike in metrics, query logs filtered by trace_id from that time window, view the distributed trace to identify the slow component.

Trace Sampling Strategies

At high traffic volumes, sampling 100% of traces is neither practical nor necessary.

StrategySample RateUse Case
Always sample errors100% of 5xxRoot cause analysis for failures
Tail-based (latency)100% of p95+Performance regression detection
Probabilistic1-5%Baseline visibility, architecture mapping
Rate-limitedN traces/secCost control with guaranteed minimum
# OpenTelemetry Collector sampling configuration
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 500}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Debugging Tools and Commands

kubectl Commands for Incident Response

# Node resource allocation vs actual usage
kubectl describe node <node-name> | grep -A 20 "Allocated resources"
kubectl top node <node-name>
 
# Pods not in Running state, across all namespaces
kubectl get pods -A --field-selector=status.phase!=Running
 
# Recent cluster events, sorted by timestamp
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
 
# Endpoints for a service (empty = selector mismatch or no ready pods)
kubectl get endpoints <service-name> -o yaml
 
# Resource usage ranked by CPU
kubectl top pods -A --sort-by=cpu | head -20
 
# Verify environment variables injected into a pod
kubectl exec -it <pod> -- env | sort
 
# Check DNS resolution from inside a pod
kubectl exec -it <pod> -- nslookup kubernetes.default.svc.cluster.local
 
# Inspect a pod's resolved DNS configuration
kubectl exec -it <pod> -- cat /etc/resolv.conf

stern: Multi-Pod Log Tailing

# Tail logs from all pods matching a name pattern
stern api-server --since 5m --namespace production
 
# Tail logs from all containers in pods with a specific label
stern --selector app=api-server --all-namespaces --since 10m
 
# Output as JSON for piping to jq
stern api-server -o json | jq 'select(.message | contains("error"))'

kubectl debug: Ephemeral Debug Containers

# Attach a debug container to a running pod (distroless/minimal images)
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=app
 
# Debug node-level issues
kubectl debug node/<node-name> -it --image=ubuntu
 
# Common debug operations inside netshoot container:
#   tcpdump -i eth0 port 8080           # Packet capture
#   curl -v http://service.namespace:80  # HTTP connectivity
#   nslookup api.stripe.com              # DNS resolution
#   ss -tlnp                             # Open TCP listeners
#   conntrack -L | wc -l                 # Conntrack entry count

k9s: Terminal UI

k9s provides a terminal-based interface for Kubernetes cluster management. Key bindings:

:pods        Navigate to pod list
:deploy      Navigate to deployments
:svc         Navigate to services
:events      Navigate to events
:ns          Switch namespace
/            Filter current view
l            View logs for selected pod
s            Shell into selected pod
d            Describe selected resource
ctrl-d       Delete selected resource
y            View YAML for selected resource
ctrl-a       Show all available resource types

Scheduler Behavior at Scale

Overcommit and QoS Classes

The scheduler evaluates only requests during the filtering phase. When limits exceed requests, the node can become overcommitted.

Overcommit Scenario
=====================

Node allocatable: 4 CPU

Pod configuration: requests=100m, limits=2000m

Scheduler perspective:
  3700m allocatable / 100m per pod = 37 pods can fit

Actual worst case:
  37 pods x 2000m limit = 74 CPU of demand on 4 CPU node
  CPU contention ratio: 18.5:1

QoS class determines eviction priority:

  Guaranteed (requests == limits):
    - Highest priority
    - Last to be evicted under node pressure
    - Predictable performance, no bursting

  Burstable (requests < limits, or requests set without limits):
    - Medium priority
    - Evicted after BestEffort pods

  BestEffort (no requests or limits):
    - Lowest priority
    - First to be evicted under any resource pressure
    - Eviction happens immediately, no grace period

For services where latency predictability is more important than cost efficiency, use Guaranteed QoS (set requests equal to limits for both CPU and memory). The tradeoff is higher resource reservation with no ability to burst beyond the allocated amount.

Priority Classes

Priority classes influence both scheduling order and eviction behavior:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-service
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For production-critical services that must be scheduled"
 
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-workload
value: 100
globalDefault: false
preemptionPolicy: Never
description: "For batch jobs that should not preempt other pods"

When a high-priority pod cannot be scheduled due to resource constraints, the scheduler can preempt (evict) lower-priority pods to make room. Setting preemptionPolicy: Never on batch workloads prevents them from being scheduled at the expense of other workloads, while still allowing them to be evicted by higher-priority pods.

Summary of Critical Configurations

AreaConfigurationImpact
CPU limitsOmit for latency-sensitive podsPrevents CFS throttling
Memory limitsAlways set, 1.3-2x working setPrevents uncontrolled OOM
DNS ndotsSet to 2Reduces DNS queries by 80%
DNS single-request-reopenEnableEliminates 5s timeout from conntrack race
NodeLocal DNSDeploy DaemonSetEliminates cross-node DNS latency
HPA minReplicasBaseline + 50% headroomAbsorbs spike during scaling delay
HPA scale-down300s stabilizationPrevents flapping
PDBmaxUnavailable: 1Protects during voluntary disruptions
Topology spreadhostname: DoNotScheduleSurvives single node failure
preStop hooksleep 5Allows endpoint removal propagation
terminationGracePeriodpreStop + drain + bufferPrevents SIGKILL during drain
maxUnavailable (rollout)0Capacity never drops during deploy
minReadySeconds30Catches pods that crash shortly after start
conntrack maxScale with pod countPrevents silent connection drops