The Cost of Downtime
Downtime during deployments is measurable. A 2023 Gartner study puts the average cost of IT downtime at $5,600 per minute for mid-size enterprises. For a deployment that takes 3 minutes of full unavailability, that is $16,800 per deploy. Organizations deploying daily accumulate over $6 million in annual downtime cost from deployments alone.
The technical causes are well understood:
- Process restart gaps where no instance is serving traffic
- Database schema locks blocking read/write operations
- Connection pool exhaustion during instance rotation
- DNS propagation delays when switching backends
- Health check failures during application warm-up
Each of these is solvable. The solutions fall into a small number of deployment strategies, each with distinct tradeoffs in complexity, resource cost, and rollback speed.
Deployment Strategy Overview
Strategy Comparison Matrix
Blue-Green Canary Rolling Recreate
ββββββββββ ββββββ βββββββ ββββββββ
Downtime Zero Zero Zero Yes
Resource overhead 2x 1x + canary 1x + 1 pod 1x
Rollback speed Seconds Seconds Minutes Minutes
Infrastructure Moderate High Low Low
complexity
Database coupling High High High Low
Traffic control Binary Granular None None
Cost (relative) 1.8x 1.1x 1.0x 1.0x
Blue-Green Deployments
Blue-green deployment maintains two identical production environments. At any time, one environment (blue) serves all production traffic while the other (green) sits idle or serves as a staging target.
Blue-Green Traffic Flow
Load Balancer / Ingress
β
β 100% traffic
βΌ
βββββββββββββββ βββββββββββββββ
β BLUE β β GREEN β
β (v1.2) β β (v1.3) β
β ACTIVE β β IDLE β
β β β β
β Pod 1 β β Pod 1 β
β Pod 2 β β Pod 2 β
β Pod 3 β β Pod 3 β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββ
β Shared Database β
β (must be compatible β
β with both versions) β
βββββββββββββββββββββββββββββββββββββββ
After switch:
Load Balancer / Ingress
β
β 100% traffic
βΌ
βββββββββββββββ βββββββββββββββ
β BLUE β β GREEN β
β (v1.2) β β (v1.3) β
β IDLE β β ACTIVE β
β β β β
β Pod 1 β β Pod 1 β
β Pod 2 β β Pod 2 β
β Pod 3 β β Pod 3 β
βββββββββββββββ βββββββββββββββ
Kubernetes Implementation
A basic blue-green deployment in Kubernetes uses two Deployments and a Service selector switch.
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-blue
labels:
app: myapp
slot: blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
slot: blue
template:
metadata:
labels:
app: myapp
slot: blue
version: v1.2.0
spec:
containers:
- name: app
image: registry.example.com/myapp:v1.2.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 3
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-green
labels:
app: myapp
slot: green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
slot: green
template:
metadata:
labels:
app: myapp
slot: green
version: v1.3.0
spec:
containers:
- name: app
image: registry.example.com/myapp:v1.3.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 3
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi# service.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
slot: blue # Change to "green" to switch traffic
ports:
- port: 80
targetPort: 8080The switch is a single kubectl patch command:
# Switch traffic from blue to green
kubectl patch service myapp \
-p '{"spec":{"selector":{"slot":"green"}}}'
# Rollback: switch back to blue
kubectl patch service myapp \
-p '{"spec":{"selector":{"slot":"blue"}}}'Rollback time with this approach is under 2 seconds, limited only by kube-proxy propagation.
Blue-Green Limitations
The primary constraint is resource cost. Two full environments double compute expenses during the deployment window. For a service running 20 pods at 1 vCPU each, a blue-green deployment requires 40 vCPUs reserved. If deployments happen once per day and take 30 minutes of validation before the idle environment is scaled down, the overhead is (20 vCPUs * 0.5 hours * 30 days) = 300 vCPU-hours per month.
The second constraint is database compatibility. Both environments share the same database, so schema changes must be backward-compatible across both versions simultaneously.
Canary Deployments
Canary deployments route a small percentage of traffic to the new version, then gradually increase that percentage while monitoring error rates, latency, and business metrics.
Canary Traffic Progression Timeline
Time Traffic Split Error Rate Action
βββββ βββββββββββββ ββββββββββ ββββββ
T+0 v1: 100% v2: 0% 0.01% Deploy canary
T+5m v1: 95% v2: 5% 0.01% Monitor
T+15m v1: 90% v2: 10% 0.02% Within threshold
T+30m v1: 75% v2: 25% 0.02% Promote
T+45m v1: 50% v2: 50% 0.01% Promote
T+60m v1: 25% v2: 75% 0.01% Promote
T+75m v1: 0% v2: 100% 0.01% Complete
Canary Architecture
ββββββββββββββββββββββββββββββββ
β Ingress Controller β
β (traffic splitting logic) β
ββββββββ¬βββββββββββββββ¬ββββββββββ
β β
95% traffic 5% traffic
β β
βΌ βΌ
ββββββββββββββ ββββββββββββββ
β Stable β β Canary β
β (v1.2) β β (v1.3) β
β β β β
β 19 pods β β 1 pod β
ββββββββββββββ ββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββ
β Metrics / Observability β
β (Prometheus, Datadog, etc.) β
ββββββββββββββββββββββββββββββββ
Argo Rollouts Canary Configuration
Argo Rollouts provides a Kubernetes-native canary deployment controller with automated analysis and promotion.
# canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 10
revisionHistoryLimit: 3
selector:
matchLabels:
app: myapp
strategy:
canary:
canaryService: myapp-canary
stableService: myapp-stable
trafficRouting:
istio:
virtualService:
name: myapp-vsvc
routes:
- primary
analysis:
templates:
- templateName: success-rate
- templateName: latency-check
startingStep: 2
args:
- name: service-name
value: myapp-canary.default.svc.cluster.local
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 75
- pause: { duration: 5m }
- setWeight: 100
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: registry.example.com/myapp:v1.3.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10Automated Canary Analysis
Argo Rollouts supports AnalysisTemplate resources that query Prometheus (or other backends) to determine whether the canary is healthy.
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.995
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(
http_requests_total{
service="{{args.service-name}}",
status=~"2.."
}[2m]
)) /
sum(rate(
http_requests_total{
service="{{args.service-name}}"
}[2m]
))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-check
spec:
args:
- name: service-name
metrics:
- name: p99-latency
interval: 60s
successCondition: result[0] <= 500
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(
http_request_duration_milliseconds_bucket{
service="{{args.service-name}}"
}[2m]
)) by (le)
)When the success rate drops below 99.5% or p99 latency exceeds 500ms across 3 consecutive checks, Argo Rollouts automatically aborts the rollout and scales the canary to zero.
Istio Traffic Splitting
For fine-grained traffic control, Istio VirtualService resources define exact traffic weights.
# istio-virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-vsvc
spec:
hosts:
- myapp.example.com
gateways:
- myapp-gateway
http:
- name: primary
route:
- destination:
host: myapp-stable
port:
number: 80
weight: 95
- destination:
host: myapp-canary
port:
number: 80
weight: 5# istio-destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp
spec:
host: myapp.example.com
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canaryHeader-Based Canary Routing
For internal testing before exposing the canary to real users, route traffic based on request headers:
# header-based-routing.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-vsvc
spec:
hosts:
- myapp.example.com
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: myapp-canary
port:
number: 80
- route:
- destination:
host: myapp-stable
port:
number: 80This allows QA teams to test the canary in production by adding x-canary: true to their requests while all other traffic continues to reach the stable version.
Feature Flags
Feature flags decouple deployment from release. Code ships to production in a disabled state, then gets enabled for specific users, percentages, or conditions without redeployment.
Feature Flag Deployment Model
Deploy v1.3 (feature disabled) Enable for 5% of users
βββββββββββββββββββββββββββββ βββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββββββ
β All Pods β β All Pods β
β run v1.3 β β run v1.3 β
β β β β
β Feature X: β β Feature X: β
β OFF β β 5% of users ON β
βββββββββββββββ βββββββββββββββββββ
β β
No user sees Flag service
the feature controls rollout
Enable for 50% Enable for 100% Remove flag
βββββββββββββββ ββββββββββββββββ βββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Feature X: β β Feature X: β β Feature X: β
β 50% ON β β 100% ON β β Code cleanedβ
β β β β β Flag removedβ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Implementation Pattern
// feature-flag-service.ts
interface FeatureFlag {
name: string;
enabled: boolean;
percentage: number; // 0-100
allowlist: string[]; // user IDs always included
blocklist: string[]; // user IDs always excluded
metadata: Record<string, string>;
}
class FeatureFlagService {
private flags: Map<string, FeatureFlag>;
private cache: Map<string, { value: boolean; expiry: number }>;
private readonly cacheTTL = 30_000; // 30 seconds
constructor(private readonly provider: FlagProvider) {
this.flags = new Map();
this.cache = new Map();
}
async isEnabled(
flagName: string,
userId: string,
attributes: Record<string, string> = {}
): Promise<boolean> {
const cacheKey = `${flagName}:${userId}`;
const cached = this.cache.get(cacheKey);
if (cached && cached.expiry > Date.now()) {
return cached.value;
}
const flag = await this.provider.getFlag(flagName);
if (!flag || !flag.enabled) return false;
if (flag.blocklist.includes(userId)) return false;
if (flag.allowlist.includes(userId)) return true;
// Deterministic percentage based on user ID hash
const hash = this.hashUserId(userId, flagName);
const bucket = hash % 100;
const result = bucket < flag.percentage;
this.cache.set(cacheKey, {
value: result,
expiry: Date.now() + this.cacheTTL,
});
return result;
}
private hashUserId(userId: string, salt: string): number {
let hash = 0;
const input = `${userId}:${salt}`;
for (let i = 0; i < input.length; i++) {
const char = input.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash = hash & hash; // Convert to 32-bit integer
}
return Math.abs(hash);
}
}// usage in request handler
async function handleRequest(req: Request): Promise<Response> {
const userId = req.headers.get("x-user-id") ?? "anonymous";
const useNewCheckout = await featureFlags.isEnabled(
"new-checkout-flow",
userId,
{ region: req.headers.get("x-region") ?? "us-east-1" }
);
if (useNewCheckout) {
return newCheckoutHandler(req);
}
return legacyCheckoutHandler(req);
}Feature Flag Lifecycle
Feature flags require active management. Stale flags accumulate as technical debt. A typical lifecycle:
Flag Lifecycle Timeline
Week 0 Week 1-2 Week 3-4 Week 5-6 Week 8
ββββββ ββββββββ ββββββββ ββββββββ ββββββ
Create Ramp up Full Cleanup Flag
flag 1% -> 50% rollout period removed
β β β β β
βΌ βΌ βΌ βΌ βΌ
Code Monitor 100% of Remove Dead code
merged metrics users on flag checks removed
with flag and errors new path from code from codebase
Organizations with more than 200 active feature flags at any time typically report increased incident rates from flag interaction bugs. A reasonable target is fewer than 50 active flags, with a maximum lifespan of 90 days per flag.
Database Migrations in Zero-Downtime Deployments
Database schema changes are the most common source of deployment-related downtime. A ALTER TABLE ... ADD COLUMN on a table with 50 million rows can acquire a lock for 30+ seconds in PostgreSQL, blocking all reads and writes.
The Expand-Contract Pattern
Safe schema changes follow a multi-phase approach across multiple deployments.
Expand-Contract Migration Timeline
Deploy 1 (Expand) Deploy 2 (Migrate) Deploy 3 (Contract)
βββββββββββββββββ ββββββββββββββββββ βββββββββββββββββββ
Add new column Backfill data Drop old column
(nullable, no default) Write to both columns Remove old code paths
Deploy code that Read from new column
writes to BOTH columns Fall back to old column
Read from OLD column
Database State:
Deploy 1:
ββββββββββββββββββββββββββββββββββββββββββββββββ
β users table β
β ββββββββββ¬ββββββββββββ¬βββββββββββββββββββ β
β β id β email β email_verified β β
β β (old) β (old) β (NEW, nullable) β β
β ββββββββββΌββββββββββββΌβββββββββββββββββββ€ β
β β 1 β a@b.com β NULL β β
β β 2 β c@d.com β NULL β β
β ββββββββββ΄ββββββββββββ΄βββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Deploy 2:
ββββββββββββββββββββββββββββββββββββββββββββββββ
β users table β
β ββββββββββ¬ββββββββββββ¬βββββββββββββββββββ β
β β id β email β email_verified β β
β β (old) β (old) β (backfilled) β β
β ββββββββββΌββββββββββββΌβββββββββββββββββββ€ β
β β 1 β a@b.com β true β β
β β 2 β c@d.com β false β β
β ββββββββββ΄ββββββββββββ΄βββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Deploy 3:
ββββββββββββββββββββββββββββββββββββββββββββββββ
β users table β
β ββββββββββ¬ββββββββββββ¬βββββββββββββββββββ β
β β id β email β email_verified β β
β β β β (NOT NULL) β β
β ββββββββββΌββββββββββββΌβββββββββββββββββββ€ β
β β 1 β a@b.com β true β β
β β 2 β c@d.com β false β β
β ββββββββββ΄ββββββββββββ΄βββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Safe Migration Examples
Adding a column (PostgreSQL):
-- Deploy 1: Add nullable column (instant, no lock)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- Deploy 2: Backfill in batches (no lock, runs in background)
UPDATE users
SET email_verified = false
WHERE email_verified IS NULL
AND id BETWEEN 1 AND 10000;
-- Repeat for each batch of 10,000 rows
-- Batch size depends on table write rate and acceptable replication lag
-- Deploy 3: Add NOT NULL constraint
-- PostgreSQL 12+ can validate existing rows without a full table lock
ALTER TABLE users
ALTER COLUMN email_verified SET NOT NULL;Renaming a column:
-- UNSAFE: This breaks all running application instances immediately
ALTER TABLE users RENAME COLUMN email TO email_address;
-- SAFE: Three-deploy approach
-- Deploy 1: Add new column, dual-write
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
CREATE OR REPLACE FUNCTION sync_email_columns()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
IF NEW.email_address IS NULL THEN
NEW.email_address := NEW.email;
END IF;
IF NEW.email IS NULL THEN
NEW.email := NEW.email_address;
END IF;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_sync_email
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_email_columns();
-- Deploy 2: Backfill, switch reads to new column
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- Deploy 3: Drop old column and trigger
DROP TRIGGER trg_sync_email ON users;
DROP FUNCTION sync_email_columns();
ALTER TABLE users DROP COLUMN email;Large Table Migration Benchmarks
Migration performance varies by table size and operation type. These benchmarks are from PostgreSQL 15 on a db.r6g.xlarge (4 vCPU, 32 GB RAM) RDS instance.
Operation 10M rows 50M rows 200M rows
βββββββββββββββββββββββββββββ ββββββββ ββββββββ βββββββββ
ADD COLUMN (nullable) < 10ms < 10ms < 10ms
ADD COLUMN (with default*) < 10ms < 10ms < 10ms
ADD NOT NULL constraint 2.1s 9.8s 41.3s
CREATE INDEX CONCURRENTLY 45s 3m 20s 14m 10s
Backfill (batch of 10k) 120ms 120ms 120ms
Full backfill (all batches) 2m 10s 10m 30s 43m 15s
* PostgreSQL 11+ does not rewrite the table for ADD COLUMN with DEFAULT
Connection Draining
When removing an old instance from a load balancer, in-flight requests must complete before the instance shuts down. Dropping active connections results in HTTP 502 errors visible to users.
Connection Draining Timeline
Load Balancer marks instance as "draining"
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Time βββββββββββββββββββββββββββββββββββββββββββΆ β
β β
β New requests: βββX (stopped immediately) β
β β
β In-flight req 1: βββββββββ β
β βββ completes (200 OK) β
β β
β In-flight req 2: βββββββββββββββ β
β βββ completes β
β β
β In-flight req 3: ββββββββββββββββββββββββ β
β βββ done β
β β
β Drain timeout (30s): βββββββββββββββββββββββββX β
β SIGTERM β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Kubernetes Graceful Shutdown
Kubernetes sends SIGTERM to a pod, then waits terminationGracePeriodSeconds before sending SIGKILL. The application must handle SIGTERM by stopping acceptance of new connections and completing in-flight work.
# pod-with-graceful-shutdown.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 5
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: registry.example.com/myapp:v1.3.0
ports:
- containerPort: 8080
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "sleep 5 && kill -SIGTERM 1"
# The preStop sleep gives the Endpoints controller
# time to remove this pod from the Service before
# the application starts rejecting connections.The 5-second preStop sleep is important. Without it, there is a race condition: the kubelet sends SIGTERM to the container at the same time as the Endpoints controller updates the Service. If the application stops accepting connections before kube-proxy removes it from the Service, new requests route to a closed port and return 502.
Application-Level Drain Handling
// graceful shutdown in Go
package main
import (
"context"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
srv := &http.Server{
Addr: ":8080",
Handler: routes(),
ReadTimeout: 10 * time.Second,
WriteTimeout: 30 * time.Second,
IdleTimeout: 60 * time.Second,
}
// Channel to listen for OS signals
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
go func() {
log.Printf("server starting on %s", srv.Addr)
if err := srv.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("server error: %v", err)
}
}()
// Block until signal received
sig := <-quit
log.Printf("received signal %s, starting graceful shutdown", sig)
// Create deadline context for shutdown
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Shutdown stops accepting new connections and waits
// for in-flight requests to complete
if err := srv.Shutdown(ctx); err != nil {
log.Printf("forced shutdown: %v", err)
}
log.Println("server stopped")
}Rolling Deployments
Rolling deployments replace pods one at a time (or in small batches). This is the default Kubernetes deployment strategy.
Rolling Update Progression (5 replicas, maxSurge=1, maxUnavailable=0)
Step 1: [v1] [v1] [v1] [v1] [v1] [v2 starting]
Step 2: [v1] [v1] [v1] [v1] [v2] β v2 ready, terminate one v1
Step 3: [v1] [v1] [v1] [v2] [v2 starting]
Step 4: [v1] [v1] [v1] [v2] [v2] β terminate next v1
Step 5: [v1] [v1] [v2] [v2] [v2 starting]
...
Step 9: [v1] [v2] [v2] [v2] [v2] β terminate last v1
Step 10: [v2] [v2] [v2] [v2] [v2] β rollout complete
# rolling-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # At most 1 extra pod during update
maxUnavailable: 0 # All existing pods must stay available
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: registry.example.com/myapp:v1.3.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 2
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2
# Allows up to 60 seconds for application startupRolling Update Tradeoffs
The rolling strategy has lower resource overhead (only 1 extra pod) but creates a period where both v1 and v2 serve traffic simultaneously. For APIs with breaking changes between versions, this mixed state can cause client errors. The total rollout time scales linearly with replica count.
Rollout Duration Estimates (maxSurge=1, maxUnavailable=0)
Replicas Startup Time Total Rollout
ββββββββ ββββββββββββ βββββββββββββ
3 10s ~45s
5 10s ~75s
10 10s ~2m 30s
20 10s ~5m
50 10s ~12m 30s
100 10s ~25m
Setting maxSurge: 25% and maxUnavailable: 25% reduces rollout time by approximately 4x at the cost of temporary capacity reduction.
Health Checks and Readiness Gates
Proper health checks prevent traffic from reaching instances that are not ready to serve. Kubernetes distinguishes three probe types:
Probe Types and Their Effects
Probe Failure Action Use Case
βββββ ββββββββββββββ ββββββββ
startup Delay other probes Slow-starting apps (JVM warm-up,
cache loading, model loading)
readiness Remove from Service Temporarily unable to serve
endpoints (DB connection lost, downstream
dependency down)
liveness Restart the container Deadlocked process, corrupted
state, unrecoverable error
A common misconfiguration: using liveness probes for transient failures. If a database is temporarily unreachable, a liveness probe failure restarts the container. But restarting does not fix the database. The result is a crash loop that amplifies the outage.
Correct approach: the readiness probe returns unhealthy when the database is unreachable (removing the pod from the Service), while the liveness probe only fails when the process itself is unresponsive.
# health-check-configuration.yaml
containers:
- name: app
image: registry.example.com/myapp:v1.3.0
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2
successThreshold: 2
livenessProbe:
httpGet:
path: /alive
port: 8080
periodSeconds: 15
failureThreshold: 3
initialDelaySeconds: 0
# initialDelaySeconds is 0 because the startup probe
# gates liveness checks until the app is ready// Health check endpoints
func healthzHandler(w http.ResponseWriter, r *http.Request) {
// Basic process health. Returns 200 if the process is running.
w.WriteHeader(http.StatusOK)
w.Write([]byte("ok"))
}
func readyHandler(w http.ResponseWriter, r *http.Request) {
// Check all dependencies required to serve traffic
if err := db.PingContext(r.Context()); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("database unreachable"))
return
}
if err := cache.Ping(r.Context()); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("cache unreachable"))
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("ready"))
}
func aliveHandler(w http.ResponseWriter, r *http.Request) {
// Only fail if the process is fundamentally broken.
// Do NOT check external dependencies here.
select {
case <-deadlockDetector:
w.WriteHeader(http.StatusInternalServerError)
w.Write([]byte("deadlock detected"))
default:
w.WriteHeader(http.StatusOK)
w.Write([]byte("alive"))
}
}Deployment Strategy Decision Matrix
Choosing a strategy depends on the constraints of the system: traffic volume, acceptable risk, infrastructure budget, and team operational maturity.
Decision Criteria Matrix
Constraint Recommended Strategy
ββββββββββββββββββββββ ββββββββββββββββββββ
Budget-constrained Rolling update
Need instant rollback Blue-green
Gradual risk reduction Canary
Decouple deploy from release Feature flags
Database schema changes Expand-contract + any of the above
Stateful services Blue-green with connection draining
Multiple coupled services Feature flags + canary
Regulatory/compliance needs Blue-green (clear audit trail)
Risk vs. Complexity Comparison
Low complexity ββββββββββββββββΊ High complexity
ββββββββββββββββββββββββββββββββββββββββββββββββ
Low risk β β
β² β Feature flags Canary + automated β
β β + canary analysis + feature flags β
β β β
β β Canary with Blue-green + canary β
β β manual gates + feature flags β
β β β
β β Blue-green Rolling + blue-green β
β β (multi-region) β
β β β
β β Rolling update Recreate (scheduled β
High risk β maintenance window) β
βΌ β β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Production Deployment Checklist
A minimal set of verification steps before, during, and after deployment:
Pre-deployment:
- Database migration is backward-compatible with current running version
- Health check endpoints are implemented and tested
- Rollback procedure is documented and tested within the last 30 days
- Feature flags are configured for any user-facing changes
- Alerting thresholds are set for error rate, latency, and saturation
During deployment:
- Monitor error rate delta between canary and stable (threshold: < 0.1% difference)
- Monitor p99 latency delta (threshold: < 50ms increase)
- Monitor CPU and memory utilization on new pods (threshold: < 80%)
- Verify database connection pool is not exhausted
- Confirm no increase in 5xx responses from downstream dependencies
Post-deployment:
- Verify all pods are running the expected version
- Confirm old ReplicaSet has scaled to zero
- Run smoke tests against production endpoints
- Check that async job processors are consuming from the correct queues
- Verify CDN cache invalidation completed if static assets changed
Metrics to Track
The following metrics provide observability into deployment health:
Metric Source Alert Threshold
ββββββ ββββββ βββββββββββββββ
Deployment frequency CI/CD pipeline N/A (track trend)
Lead time for changes Git + CI/CD > 24 hours
Change failure rate Incident tracker > 5%
Mean time to recovery (MTTR) Incident tracker > 30 minutes
Rollback count (weekly) CI/CD pipeline > 2 per week
Deployment duration Argo / k8s events > 15 minutes
Pod restart count (post-deploy) Kubernetes metrics > 0
Error rate delta (deploy) Prometheus > 0.1%
p99 latency delta (deploy) Prometheus > 50ms
These align with the DORA metrics framework. Organizations performing in the "elite" category deploy multiple times per day with a change failure rate below 5% and MTTR under 1 hour.
Summary of Key Numbers
- Blue-green rollback time: < 2 seconds
- Canary minimum observation window: 5 minutes per weight step
- Feature flag cache TTL: 30 seconds (balances freshness vs. load)
- Pre-stop sleep for Kubernetes pods: 5 seconds
- Graceful shutdown timeout: 30 to 60 seconds
- Database backfill batch size: 5,000 to 10,000 rows
- Maximum concurrent feature flags: fewer than 50
- Feature flag maximum lifespan: 90 days
- Health check readiness probe interval: 5 seconds
- Liveness probe failure threshold: 3 (avoid restart loops)