Resource Requests and Limits
Resource requests are the foundation of Kubernetes scheduling. The scheduler uses requests, not actual utilization, to make placement decisions. Inaccurate requests cascade into scheduling failures, bin-packing inefficiency, and node-level resource contention.
Scheduling Decision Flow
========================
Pod enters scheduling queue
|
v
+------------------------------+
| Filtering Phase |
| "Can this node run the pod?"|
| |
| Node allocatable: 3.7 CPU |
| Already reserved: 2.5 CPU |
| Remaining: 1.2 CPU |
| |
| Pod requests 2.0 CPU? |
| 1.2 < 2.0 --> REJECT |
| |
| Pod requests 0.5 CPU? |
| 1.2 >= 0.5 --> ACCEPT |
+------------------------------+
|
v
+------------------------------+
| Scoring Phase |
| "Which node is best?" |
| |
| LeastRequested score |
| BalancedAllocation score |
| TopologySpread score |
| NodeAffinity score |
| |
| Weighted sum --> winner |
+------------------------------+
|
v
Pod bound to node
A basic resource specification:
resources:
requests:
cpu: 500m # Scheduler reserves this amount on the node
memory: 256Mi # Scheduler reserves this amount on the node
limits:
cpu: 1000m # CFS bandwidth control enforced here
memory: 512Mi # OOM killer enforced hereCFS Throttling and CPU Limits
The Linux Completely Fair Scheduler (CFS) bandwidth controller operates on a configurable period, defaulting to 100ms. When a container has a CPU limit of 1000m, it receives a quota of 100ms of CPU time per 100ms period. If the container exhausts this quota in a burst (garbage collection, request batching, JIT compilation), the kernel throttles all threads in that cgroup for the remainder of the period. This occurs regardless of available CPU capacity on the node.
CFS Throttling Mechanism (100ms default period)
================================================
CPU limit: 1000m = 100ms quota per 100ms period
Period 1 Period 2 Period 3
|---- 100ms ------| |---- 100ms ------| |---- 100ms ------|
[################] [########........] [##########......]
100ms consumed 80ms consumed 95ms consumed
THROTTLED 0ms OK, 20ms unused OK, 5ms unused
|
+-- All threads frozen until next period boundary.
Node may have 6 idle cores. Does not matter.
Quota is per-cgroup, not per-node.
Impact on request latency:
Request arrives at t=92ms into period
Remaining quota: 8ms
Request needs: 25ms of CPU
Result:
t=92ms: Start processing (8ms quota remains)
t=100ms: Quota exhausted, thread frozen
t=100ms: New period begins, 100ms quota refreshed
t=117ms: Processing completes (8ms + 17ms = 25ms CPU)
Total wall time: 25ms elapsed
Without throttling: 25ms elapsed
Visible penalty: 0ms (got lucky, small request)
Worse case: GC pause starts at t=5ms, burns 95ms
t=100ms: Quota exhausted, thread frozen
t=100ms: New period, another 40ms needed
t=140ms: GC completes
Application thread resumes, processes request
User-visible latency: extra 40ms+ added to any
request that overlapped with the GC pause
When CFS throttling kicks in on a latency-sensitive pod, standard monitoring typically shows low average CPU utilization. The throttling is invisible unless specifically measured.
# Direct cgroup inspection (cgroup v2)
kubectl exec -it <pod> -- cat /sys/fs/cgroup/cpu.stat
# Fields of interest:
# nr_throttled - number of times the cgroup was throttled
# throttled_usec - total time spent throttled in microseconds
# cgroup v1 (older kernels)
kubectl exec -it <pod> -- cat /sys/fs/cgroup/cpu/cpu.stat
# Fields: nr_throttled, throttled_time
# Prometheus query for throttle ratio
# Values above 0.05 (5%) indicate a problem
rate(container_cpu_cfs_throttled_periods_total{container="app"}[5m])
/
rate(container_cpu_cfs_periods_total{container="app"}[5m])Resource Strategy Comparison
| Strategy | CPU Requests | CPU Limits | Memory Limits | QoS Class | Use Case |
|---|---|---|---|---|---|
| Guaranteed | 1000m | 1000m | 512Mi (= req) | Guaranteed | Latency-critical, databases |
| Burstable, no CPU limit | 500m | (none) | 512Mi | Burstable | API servers, web services |
| Burstable, with CPU limit | 500m | 1000m | 512Mi | Burstable | Background workers, batch |
| BestEffort | (none) | (none) | (none) | BestEffort | Development only, never production |
For latency-sensitive workloads, omitting CPU limits prevents CFS throttling while still providing scheduling guarantees through requests:
# Recommended for latency-sensitive services
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
# CPU limit intentionally omitted to prevent CFS throttling
memory: 512Mi # Memory limits should always be setMemory limits must always be set. Exceeding a memory limit triggers the OOM killer, which sends SIGKILL (not SIGTERM). No graceful shutdown occurs. Set memory limits to 1.3x-2x the observed steady-state working set.
Determining Correct Request Values
Use Vertical Pod Autoscaler in recommendation-only mode or query Prometheus directly:
# CPU request target: P95 usage over 7 days
quantile_over_time(0.95,
rate(container_cpu_usage_seconds_total{
container="app", namespace="production"
}[5m])[7d:]
)
# Memory request target: P99 usage over 7 days
# Higher percentile because memory overshoot triggers OOM kill
quantile_over_time(0.99,
container_memory_working_set_bytes{
container="app", namespace="production"
}[7d:]
)VPA in auto mode should not be used for production workloads. It restarts pods to apply new resource values, which can cause connection resets during in-flight requests.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: app
updatePolicy:
updateMode: "Off" # Recommendation only, no automatic resizingHorizontal Pod Autoscaler
Scaling Latency
The time between a traffic increase and new capacity becoming available is typically 60-120 seconds. This delay is composed of multiple sequential stages:
HPA Scaling Timeline
=====================
t=0s Traffic spike arrives
Existing pods absorb load, CPU rises
|
t=15s cAdvisor scrapes container metrics (default 15s interval)
|
t=30s metrics-server aggregates data (15s window)
|
t=45s HPA controller evaluates metrics (default 15s sync period)
HPA calculates desired replica count
HPA issues scale request to Deployment
|
t=47s Scheduler assigns new pods to nodes
|
t=47s Kubelet begins image pull
to Image size matters:
t=60s 50MB Alpine-based: ~3s pull
200MB JDK-based: ~8s pull
2GB full OS + deps: 30s+ pull
|
t=60s Container runtime starts container
Application initialization begins
Go binary: ~1s
Node.js: ~3s
JVM (Spring): 15-45s
|
t=65s Readiness probe begins checking
to initialDelaySeconds + periodSeconds * successThreshold
t=80s |
v
t=80s Pod passes readiness, added to Endpoints object
kube-proxy/IPVS rules updated on all nodes
|
t=85s New pod begins receiving traffic
=====================================
85 seconds minimum from spike to relief
HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 10
maxReplicas: 100
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up without delay
policies:
- type: Percent
value: 100 # Allow doubling capacity per step
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300 # 5 minute cooldown before scale-down
policies:
- type: Percent
value: 10 # Remove at most 10% per period
periodSeconds: 60
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Scale at 60%, not 80%Key parameters:
minReplicasshould be set to handle baseline traffic plus 50% headroom. These pods absorb the initial spike during the 60-120 second autoscaling delay.- Scale-up should be fast (no stabilization window). Scale-down should be slow (5+ minute stabilization) to prevent flapping.
- Target utilization at 60% provides headroom. At 80% target utilization, a 30% traffic increase pushes pods to 104% before the HPA reacts.
Cascading Failure Under Autoscaling
When pods become overloaded, readiness probes can fail, causing Kubernetes to remove them from service endpoints. This increases load on remaining pods, triggering further readiness failures.
Cascading Failure Progression
==============================
State 0: Steady state
10 pods, 100 req/s each, 50% CPU
Total capacity: 1000 req/s
State 1: Traffic spike (t=0s)
1500 req/s total
Each pod: 150 req/s, ~75% CPU
HPA triggered, scaling in progress
State 2: First failure (t=30s)
Pod-3 fails readiness probe (overloaded)
Removed from endpoints
9 pods handling 1500 req/s
Each pod: 167 req/s, ~83% CPU
State 3: Cascade begins (t=45s)
Pod-7 fails readiness, removed
8 pods, 188 req/s each, ~94% CPU
Pod-1 OOMKilled (memory spike under load)
7 pods, 214 req/s each, over capacity
State 4: Collapse (t=60s)
3 more pods fail readiness or OOM
4 pods remain, completely saturated
Effectively zero successful responses
State 5: Recovery attempt (t=90s)
HPA scaled to 15 replicas
New pods starting, but surviving pods
cannot serve traffic including health checks
Recovery takes 3-5x longer than the spike duration
Mitigations:
# Tolerant readiness probes prevent cascade removal
readinessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 6 # Must fail for 60s before removal
successThreshold: 1
# Pre-scale for known traffic events using CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: prescale-peak-hours
spec:
schedule: "30 8 * * 1-5" # 8:30 AM weekdays, 30 min before peak
jobTemplate:
spec:
template:
spec:
containers:
- name: prescaler
image: bitnami/kubectl:latest
command:
- kubectl
- patch
- hpa/api-server
- --type=merge
- -p
- '{"spec":{"minReplicas":25}}'
restartPolicy: OnFailureApplication-level circuit breakers should return 503 immediately under overload rather than queuing requests that will timeout. Fast failure preserves capacity for requests that can succeed.
DNS Resolution
The ndots Problem
The default ndots value in Kubernetes is 5. Any hostname with fewer than 5 dots is treated as a relative name, and the resolver appends each search domain before trying the absolute name.
DNS Resolution with ndots:5 (default)
=======================================
Application resolves: api.stripe.com (2 dots, fewer than 5)
Query sequence:
1. api.stripe.com.default.svc.cluster.local A + AAAA --> NXDOMAIN
2. api.stripe.com.svc.cluster.local A + AAAA --> NXDOMAIN
3. api.stripe.com.cluster.local A + AAAA --> NXDOMAIN
4. api.stripe.com.us-west-2.compute.internal A + AAAA --> NXDOMAIN
5. api.stripe.com. A + AAAA --> RESOLVED
Total DNS packets: 10 (5 names x 2 record types)
Wasted queries: 8 out of 10
At scale:
10,000 pods x 50 external calls/sec x 10 packets/call
= 5,000,000 DNS packets/sec hitting CoreDNS
DNS Resolution with ndots:2 (recommended)
==========================================
Application resolves: api.stripe.com (2 dots, equal to ndots)
Query sequence:
1. api.stripe.com. A + AAAA --> RESOLVED
Total DNS packets: 2
Wasted queries: 0
Internal names still work:
my-service.my-namespace (1 dot, fewer than 2)
Search path appended: my-service.my-namespace.svc.cluster.local --> RESOLVED
Configuration:
# Pod-level DNS configuration
spec:
dnsConfig:
options:
- name: ndots
value: "2"
- name: single-request-reopenThe single-request-reopen option addresses a conntrack race condition. When the resolver sends A and AAAA queries simultaneously on the same UDP socket, both outgoing packets may receive the same conntrack entry source port. The kernel drops the second reply because it appears to be a duplicate. The resolver then waits for a 5-second timeout before retrying. single-request-reopen forces the resolver to use a new socket for the second query, avoiding the collision.
Conntrack Race Condition
=========================
Without single-request-reopen:
App socket (port 12345)
|
+-- Send A query --> conntrack entry: src=12345 dst=53
+-- Send AAAA query --> conntrack: SAME src=12345 dst=53
|
Reply to A: conntrack matches, delivered to app
Reply to AAAA: conntrack says "already seen reply", DROPS packet
|
App waits 5 seconds for AAAA timeout
Retries on new socket
Total resolution time: ~5000ms instead of ~1ms
With single-request-reopen:
App socket 1 (port 12345)
+-- Send A query --> conntrack entry: src=12345 dst=53
App socket 2 (port 12346)
+-- Send AAAA query --> conntrack entry: src=12346 dst=53
|
Both replies delivered correctly
Total resolution time: ~1ms
CoreDNS at Scale
CoreDNS Scaling Architecture
==============================
Tier 1: Per-node cache (NodeLocal DNSCache DaemonSet)
+----------+ +----------+ +----------+ +----------+
| Node 1 | | Node 2 | | Node 3 | | Node N |
| | | | | | | |
| Pod Pod | | Pod Pod | | Pod Pod | | Pod Pod |
| | | | | | | | | | | | | | | |
| v v | | v v | | v v | | v v |
| NodeLocal| | NodeLocal| | NodeLocal| | NodeLocal|
| DNS Cache| | DNS Cache| | DNS Cache| | DNS Cache|
+----+-----+ +----+-----+ +----+-----+ +----+-----+
| | | |
+--------------+--------------+--------------+
|
v
Tier 2: Cluster CoreDNS (scaled replicas)
+--------------------------------------+
| CoreDNS (6-10 replicas) |
| Plugins: autopath, cache, forward |
+--------------------------------------+
|
v
Tier 3: Upstream DNS (VPC resolver, cloud DNS)
NodeLocal DNSCache runs as a DaemonSet, providing a per-node caching resolver. Cache hits never leave the node, eliminating cross-node network latency and reducing CoreDNS query volume by approximately 80%.
The autopath CoreDNS plugin optimizes search domain resolution by returning the correct answer on the first query rather than requiring the client to iterate through all search domains.
CoreDNS replica count should scale with cluster size. A rough guideline: 1 CoreDNS pod per 500 application pods, with a minimum of 3 replicas for redundancy.
etcd Performance
etcd backs all Kubernetes state. Every API server operation translates to a read or write against etcd's Raft-replicated key-value store.
etcd Write Path
================
kubectl apply -f deployment.yaml
|
v
+------------------+
| API Server |
| 1. Authenticate |
| 2. Authorize |
| 3. Validate |
| 4. Admit |
+--------+---------+
|
v Write request
+------------------------------------------+
| etcd Cluster |
| |
| +--------+ +--------+ +--------+ |
| | Leader |--->|Follower|--->|Follower| |
| | |<---| 1 | | 2 | |
| +--------+ +--------+ +--------+ |
| |
| 1. Leader receives write |
| 2. Leader appends to local WAL |
| 3. Leader replicates to followers |
| 4. Majority (2/3) confirm |
| 5. Leader commits, responds to client |
| |
| Write latency = WAL fsync + network RTT |
| Target: < 10ms for p99 |
+------------------------------------------+
Performance-Critical Factors
Disk latency. etcd calls fdatasync on every write to the Write-Ahead Log. This is required by the Raft consensus protocol for durability guarantees. Shared or network-attached storage introduces latency that directly impacts every Kubernetes API operation. Dedicated NVMe SSDs are required for production etcd nodes. Target: p99 WAL fsync latency below 10ms.
# Check etcd disk fsync latency via Prometheus
# (available on self-managed clusters)
histogram_quantile(0.99,
rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)
# Values above 0.01 (10ms) indicate disk performance problems
# On managed Kubernetes, monitor API server latency as a proxy
histogram_quantile(0.99,
rate(apiserver_request_duration_seconds_bucket{verb="POST"}[5m])
)
# Sustained values above 0.5s suggest etcd pressureObject count and watch connections. Each controller, kubelet, and operator maintains watch connections to the API server, which correspond to watches on etcd. A 500-node cluster with standard controllers and operators can sustain 50,000+ active watches. Each watch consumes memory on both the API server and etcd.
# Count objects by resource type, sorted by count
kubectl api-resources --verbs=list -o name | while read r; do
count=$(kubectl get "$r" -A --no-headers 2>/dev/null | wc -l)
[ "$count" -gt 0 ] && echo "$count $r"
done | sort -rn | head -20Object size. etcd has a default maximum request size of 1.5MB. ConfigMaps and Secrets approaching this limit cause write failures with etcdserver: request is too large. Large data should be stored externally (S3, a database) with references in Kubernetes objects. Secrets are base64-encoded, which inflates size by approximately 33%.
Compaction and defragmentation. etcd maintains a revision history for all keys to support watch functionality. Without periodic compaction, the database grows continuously. On self-managed clusters, configure automatic compaction:
# etcd startup flags for compaction
--auto-compaction-retention=1h
--auto-compaction-mode=periodic
# After compaction, defragment to reclaim disk space
etcdctl defrag --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379
On managed Kubernetes (EKS, GKE, AKS), compaction is handled by the provider.
Networking
CNI Plugin Selection
The CNI plugin determines how pod networking is implemented. At scale, the choice has significant performance implications.
iptables-based Service Routing (kube-proxy default)
=====================================================
Service with 3 endpoints:
KUBE-SERVICES chain:
Rule 1: -d 10.96.0.1/32 -p tcp --dport 443 --> KUBE-SVC-xxx
Rule 2: -d 10.96.0.10/32 -p tcp --dport 53 --> KUBE-SVC-yyy
...
Rule N: one rule per Service ClusterIP
KUBE-SVC-xxx chain:
Rule 1: -m statistic --probability 0.333 --> KUBE-SEP-aaa
Rule 2: -m statistic --probability 0.500 --> KUBE-SEP-bbb
Rule 3: --> KUBE-SEP-ccc
KUBE-SEP-aaa: DNAT to 10.244.1.5:8080
KUBE-SEP-bbb: DNAT to 10.244.2.8:8080
KUBE-SEP-ccc: DNAT to 10.244.3.2:8080
Rule count growth:
Services iptables Rules iptables-save Time Rule Update Time
50 ~500 < 1s < 1s
500 ~5,000 ~2s ~2s
2,000 ~20,000 ~5s ~4s
5,000 ~50,000 ~8s ~5s
10,000 ~100,000 ~15s ~10s
During rule updates, packet processing stalls.
CNI Comparison
| Feature | Calico (iptables) | Calico (eBPF) | Cilium (eBPF) | AWS VPC CNI |
|---|---|---|---|---|
| Service routing | iptables | eBPF | eBPF | iptables/eBPF |
| Network Policy | Yes | Yes | Yes (extended) | Via Calico addon |
| Encryption | WireGuard | WireGuard | WireGuard/IPsec | None (VPC level) |
| Observability | Basic | Flow logs | Hubble (L7) | VPC Flow Logs |
| Max pods/node | Overlay dependent | Overlay dependent | Overlay dependent | ENI limited |
| Scale tested | 5,000+ nodes | 5,000+ nodes | 5,000+ nodes | 750 nodes (EKS) |
| Service routing latency (relative) | Baseline | 40% lower | 40% lower | Varies |
eBPF-based dataplanes (Cilium, Calico eBPF) attach programs directly to network interfaces, bypassing iptables entirely. Service routing operations become O(1) hash lookups instead of O(n) chain traversals. This eliminates the scaling problems with iptables at high service counts.
Conntrack Table Exhaustion
The conntrack table tracks active network connections for NAT and stateful firewalling. Each entry consumes approximately 288 bytes. The default maximum is typically 131,072 entries (nf_conntrack_max).
Conntrack Table Pressure
==========================
Each active connection = 1 conntrack entry
Each DNAT (Service routing) = 1 conntrack entry
Each DNS query (UDP) = 1 conntrack entry (with timeout)
10,000 pods, each maintaining:
5 persistent connections to databases
10 connections to other services
2 DNS queries/sec (conntrack timeout: 30s = 60 entries)
Per pod: ~75 conntrack entries
Total: 750,000 entries
Default nf_conntrack_max: 131,072
Result: conntrack table full, new connections DROPPED silently
Symptoms:
- Intermittent connection timeouts
- "nf_conntrack: table full, dropping packet" in dmesg
- No errors visible at application layer
# Check conntrack utilization on a node
kubectl debug node/<node-name> -it --image=busybox -- \
cat /proc/sys/net/netfilter/nf_conntrack_count
kubectl debug node/<node-name> -it --image=busybox -- \
cat /proc/sys/net/netfilter/nf_conntrack_max
# Prometheus alert for conntrack pressure
# Alert when conntrack usage exceeds 80% of maximum
node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8Increase nf_conntrack_max via sysctl tuning on node pools, or migrate to an eBPF-based CNI that does not rely on conntrack for service routing.
Network Policies
By default, all pod-to-pod communication is allowed. A default-deny policy should be applied to every production namespace, with explicit allow rules for required communication paths.
# Default deny all ingress and egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Explicit allow: api-server ingress from ingress controller,
# egress to database and DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- port: 8080
protocol: TCP
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- port: 5432
protocol: TCP
- to: # DNS must be explicitly allowed under default-deny
- namespaceSelector: {}
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCPDNS egress must be explicitly allowed. Without it, all name resolution fails and every outbound connection attempt hangs until timeout.
Service Mesh Overhead
Sidecar-based service meshes (Istio, Linkerd) add a proxy container to every pod. The resource overhead scales linearly with pod count.
Service Mesh Resource Overhead at Scale
=========================================
Per-pod sidecar cost (Envoy/Istio):
Memory: 100-200 MB
CPU: 50-100m
Latency: 1-5ms per hop (both directions)
At 10,000 pods:
Memory: 10,000 x 150MB = 1.5 TB dedicated to proxies
CPU: 10,000 x 75m = 750 cores dedicated to proxies
Cost: Approximately $15,000-25,000/month in compute
Alternative: Cilium network policies + application-level retries
Memory overhead: 0 per pod (DaemonSet-based)
Latency overhead: < 0.1ms (kernel-level processing)
mTLS: Available via Cilium mutual authentication or WireGuard
Provides approximately 80% of service mesh functionality
at approximately 10% of the resource cost.
A service mesh is justified when L7 traffic management (header-based routing, traffic mirroring, fault injection) is a hard requirement. For mTLS and network policy alone, CNI-level solutions are more efficient.
Pod Disruption Budgets and Topology Spread
Pod Disruption Budgets
PDBs control how many pods of a workload can be simultaneously unavailable during voluntary disruptions (node drains, cluster upgrades, autoscaler scale-down).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: api-serverWith maxUnavailable: 1, kubectl drain will evict at most one matching pod at a time and block until a replacement is running before evicting the next.
PDBs do not protect against involuntary disruptions (node hardware failure, kernel panic, OOM kill). Topology spread constraints address this gap.
Topology Spread Constraints
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-serverEffect of Topology Spread on Failure Domains
===============================================
Without constraints: With constraints:
Node-1 (zone-a) Node-1 (zone-a)
+- api-pod-1 +- api-pod-1
+- api-pod-2
+- api-pod-3 Node-2 (zone-a)
+- api-pod-2
Node-2 (zone-a)
+- (other pods) Node-3 (zone-b)
+- api-pod-3
Node-3 (zone-b)
+- (other pods) Node-4 (zone-b)
+- api-pod-4
Node-1 failure: Node-1 failure:
3/3 api pods lost 1/4 api pods lost
100% outage 75% capacity maintained
Zone-a failure: Zone-a failure:
3/3 api pods lost 2/4 api pods lost
100% outage 50% capacity maintained
The DoNotSchedule policy for hostname spread is strict: the scheduler will leave a pod unschedulable rather than violate the constraint. The ScheduleAnyway policy for zone spread is best-effort: the scheduler prefers balanced placement but will place pods in an imbalanced configuration if no better option exists.
Graceful Shutdown
When Kubernetes terminates a pod, two concurrent processes begin: SIGTERM delivery to the container and removal from Service endpoints. These processes are not synchronized.
Pod Termination Sequence
=========================
API server sets pod status to Terminating
|
+----> Kubelet receives update
| |
| +-> Execute preStop hook (if defined)
| +-> Send SIGTERM to PID 1
| |
| +-> Start terminationGracePeriodSeconds countdown
|
+----> Endpoints controller removes pod from Endpoints
|
+-> kube-proxy updates iptables/IPVS on each node
| (propagation delay: 1-10 seconds across cluster)
|
+-> Ingress controllers remove from upstream list
(propagation delay: 1-15 seconds depending on
sync interval and controller implementation)
Timeline with preStop sleep:
t=0s Pod marked Terminating
preStop hook executes: sleep 5
Endpoints controller begins removal
t=0-5s Endpoint removal propagates across cluster
Pod still running, still accepting connections
In-flight requests continue processing
t=5s preStop hook completes
SIGTERM delivered to application
Application begins draining connections
(stop accepting new connections, finish existing ones)
t=5-25s Application drains in-flight requests
t=25s Application exits cleanly
t=30s terminationGracePeriodSeconds expires (default)
If process still running: SIGKILL sent
Required configuration for zero-downtime termination:
spec:
terminationGracePeriodSeconds: 60 # Must exceed preStop + drain time
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"]Application-side SIGTERM handling (Go example):
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
<-quit
log.Println("SIGTERM received, starting drain")
// Stop accepting new connections
server.SetKeepAlivesEnabled(false)
// Allow in-flight requests to complete
ctx, cancel := context.WithTimeout(context.Background(), 50*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
log.Printf("Forced shutdown: %v", err)
}
log.Println("Drain complete, exiting")The terminationGracePeriodSeconds value must be greater than preStop sleep + maximum expected drain time. If the application has not exited when the grace period expires, the container receives SIGKILL.
Rolling Updates
The default rolling update strategy (maxSurge: 25%, maxUnavailable: 25%) removes up to 25% of existing pods before replacements are confirmed healthy. For a 100-pod deployment, 25 pods can be simultaneously unavailable.
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 10%
maxUnavailable: 0 # Never reduce capacity during rollout
minReadySeconds: 30 # Pod must be Ready for 30s before considered AvailableRolling Update with maxUnavailable: 0
=======================================
Initial: 10 pods v1, all Ready
Step 1: Create 1 new pod (v2)
v1: [R] [R] [R] [R] [R] [R] [R] [R] [R] [R]
v2: [Starting...]
Serving capacity: 10 pods (100%)
Step 2: v2 pod passes readiness, waits minReadySeconds
v1: [R] [R] [R] [R] [R] [R] [R] [R] [R] [R]
v2: [R] (waiting 30s)
Serving capacity: 11 pods (110%)
Step 3: v2 pod available (30s elapsed), terminate 1 v1 pod
v1: [R] [R] [R] [R] [R] [R] [R] [R] [R] [Terminating]
v2: [R]
Serving capacity: 10 pods (100%)
Step 4: Repeat until all pods are v2
Capacity never drops below 100%
Compare with maxUnavailable: 25%:
Step 1: Immediately terminate 2-3 v1 pods, start v2 pods
Capacity drops to 70-80% during transition
If v2 has a startup bug: 25% capacity lost to broken pods
minReadySeconds provides a safety window. Pods that start successfully but crash after 10-15 seconds (due to configuration errors, nil pointer dereferences, or dependency failures) are caught before old pods are terminated.
Observability
Metrics, Logs, and Traces Architecture
Observability Pipeline
========================
+-------------------+
| Application Pods |
| (instrumented |
| with OTel SDK) |
+---+------+-----+--+
| | |
| | +-------> Traces (OTLP/gRPC)
| | |
| +-----------> Logs (stdout/stderr)
| | |
+-----------------> Metrics (Prometheus scrape)
| | |
v v v
+----------+ +----------+ +----------+
|Prometheus | | Fluent | | OTel |
| (scrape) | | Bit | | Collector|
| | | DaemonSet| | |
+-----+----+ +-----+----+ +-----+----+
| | |
v v v
+----------+ +----------+ +----------+
| Thanos | | Loki | | Tempo |
| (long | | (log | | (trace |
| term) | | store) | | store) |
+-----+----+ +-----+----+ +-----+----+
| | |
+-------+-------+-------+-----+
|
v
+---------------+
| Grafana |
| Correlation: |
| trace_id ties |
| metrics, logs,|
| traces into a |
| single view |
+---------------+
Metrics Cardinality
Prometheus stores one time series per unique label combination. Unbounded labels (user IDs, request IDs, IP addresses) cause cardinality explosion.
Cardinality Calculation
========================
Labels: {method, endpoint, status_bucket}
method: 4 values (GET, POST, PUT, DELETE)
endpoint: 50 values (/api/v1/users, /api/v1/orders, ...)
status_bucket: 5 values (2xx, 3xx, 4xx, 5xx, other)
Total series: 4 x 50 x 5 = 1,000 series per metric
At 15s scrape interval: 1,000 x 4 samples/min = manageable
Adding user_id label (1,000,000 unique users):
Total series: 4 x 50 x 5 x 1,000,000 = 1,000,000,000 series
Prometheus memory: ~2 KB per series = 2 TB RAM
Result: OOM crash
Rules for label cardinality:
- Labels should have bounded, low-cardinality values (typically fewer than 100 unique values per label).
- High-cardinality identifiers (user_id, trace_id, request_id) belong in logs and traces, not metrics.
- Use
status_bucket(2xx, 3xx, 4xx, 5xx) instead of rawstatus_code(200, 201, 204, 301, ...).
Structured Logging
All log output should be structured JSON with a trace_id field for correlation:
{
"level": "info",
"ts": "2025-06-22T14:30:05.123Z",
"msg": "request processed",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"service": "api-gateway",
"method": "POST",
"path": "/api/v1/orders",
"status": 201,
"latency_ms": 45,
"bytes_out": 892
}The trace_id enables cross-system correlation: identify a latency spike in metrics, query logs filtered by trace_id from that time window, view the distributed trace to identify the slow component.
Trace Sampling Strategies
At high traffic volumes, sampling 100% of traces is neither practical nor necessary.
| Strategy | Sample Rate | Use Case |
|---|---|---|
| Always sample errors | 100% of 5xx | Root cause analysis for failures |
| Tail-based (latency) | 100% of p95+ | Performance regression detection |
| Probabilistic | 1-5% | Baseline visibility, architecture mapping |
| Rate-limited | N traces/sec | Cost control with guaranteed minimum |
# OpenTelemetry Collector sampling configuration
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 500}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 5}Debugging Tools and Commands
kubectl Commands for Incident Response
# Node resource allocation vs actual usage
kubectl describe node <node-name> | grep -A 20 "Allocated resources"
kubectl top node <node-name>
# Pods not in Running state, across all namespaces
kubectl get pods -A --field-selector=status.phase!=Running
# Recent cluster events, sorted by timestamp
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
# Endpoints for a service (empty = selector mismatch or no ready pods)
kubectl get endpoints <service-name> -o yaml
# Resource usage ranked by CPU
kubectl top pods -A --sort-by=cpu | head -20
# Verify environment variables injected into a pod
kubectl exec -it <pod> -- env | sort
# Check DNS resolution from inside a pod
kubectl exec -it <pod> -- nslookup kubernetes.default.svc.cluster.local
# Inspect a pod's resolved DNS configuration
kubectl exec -it <pod> -- cat /etc/resolv.confstern: Multi-Pod Log Tailing
# Tail logs from all pods matching a name pattern
stern api-server --since 5m --namespace production
# Tail logs from all containers in pods with a specific label
stern --selector app=api-server --all-namespaces --since 10m
# Output as JSON for piping to jq
stern api-server -o json | jq 'select(.message | contains("error"))'kubectl debug: Ephemeral Debug Containers
# Attach a debug container to a running pod (distroless/minimal images)
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=app
# Debug node-level issues
kubectl debug node/<node-name> -it --image=ubuntu
# Common debug operations inside netshoot container:
# tcpdump -i eth0 port 8080 # Packet capture
# curl -v http://service.namespace:80 # HTTP connectivity
# nslookup api.stripe.com # DNS resolution
# ss -tlnp # Open TCP listeners
# conntrack -L | wc -l # Conntrack entry countk9s: Terminal UI
k9s provides a terminal-based interface for Kubernetes cluster management. Key bindings:
:pods Navigate to pod list
:deploy Navigate to deployments
:svc Navigate to services
:events Navigate to events
:ns Switch namespace
/ Filter current view
l View logs for selected pod
s Shell into selected pod
d Describe selected resource
ctrl-d Delete selected resource
y View YAML for selected resource
ctrl-a Show all available resource types
Scheduler Behavior at Scale
Overcommit and QoS Classes
The scheduler evaluates only requests during the filtering phase. When limits exceed requests, the node can become overcommitted.
Overcommit Scenario
=====================
Node allocatable: 4 CPU
Pod configuration: requests=100m, limits=2000m
Scheduler perspective:
3700m allocatable / 100m per pod = 37 pods can fit
Actual worst case:
37 pods x 2000m limit = 74 CPU of demand on 4 CPU node
CPU contention ratio: 18.5:1
QoS class determines eviction priority:
Guaranteed (requests == limits):
- Highest priority
- Last to be evicted under node pressure
- Predictable performance, no bursting
Burstable (requests < limits, or requests set without limits):
- Medium priority
- Evicted after BestEffort pods
BestEffort (no requests or limits):
- Lowest priority
- First to be evicted under any resource pressure
- Eviction happens immediately, no grace period
For services where latency predictability is more important than cost efficiency, use Guaranteed QoS (set requests equal to limits for both CPU and memory). The tradeoff is higher resource reservation with no ability to burst beyond the allocated amount.
Priority Classes
Priority classes influence both scheduling order and eviction behavior:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-service
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For production-critical services that must be scheduled"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-workload
value: 100
globalDefault: false
preemptionPolicy: Never
description: "For batch jobs that should not preempt other pods"When a high-priority pod cannot be scheduled due to resource constraints, the scheduler can preempt (evict) lower-priority pods to make room. Setting preemptionPolicy: Never on batch workloads prevents them from being scheduled at the expense of other workloads, while still allowing them to be evicted by higher-priority pods.
Summary of Critical Configurations
| Area | Configuration | Impact |
|---|---|---|
| CPU limits | Omit for latency-sensitive pods | Prevents CFS throttling |
| Memory limits | Always set, 1.3-2x working set | Prevents uncontrolled OOM |
| DNS ndots | Set to 2 | Reduces DNS queries by 80% |
| DNS single-request-reopen | Enable | Eliminates 5s timeout from conntrack race |
| NodeLocal DNS | Deploy DaemonSet | Eliminates cross-node DNS latency |
| HPA minReplicas | Baseline + 50% headroom | Absorbs spike during scaling delay |
| HPA scale-down | 300s stabilization | Prevents flapping |
| PDB | maxUnavailable: 1 | Protects during voluntary disruptions |
| Topology spread | hostname: DoNotSchedule | Survives single node failure |
| preStop hook | sleep 5 | Allows endpoint removal propagation |
| terminationGracePeriod | preStop + drain + buffer | Prevents SIGKILL during drain |
| maxUnavailable (rollout) | 0 | Capacity never drops during deploy |
| minReadySeconds | 30 | Catches pods that crash shortly after start |
| conntrack max | Scale with pod count | Prevents silent connection drops |