Zero-Downtime Deployments: Blue-Green, Canary, and Feature Flags

April 12, 2024

The Cost of Downtime

Downtime during deployments is measurable. A 2023 Gartner study puts the average cost of IT downtime at $5,600 per minute for mid-size enterprises. For a deployment that takes 3 minutes of full unavailability, that is $16,800 per deploy. Organizations deploying daily accumulate over $6 million in annual downtime cost from deployments alone.

The technical causes are well understood:

  • Process restart gaps where no instance is serving traffic
  • Database schema locks blocking read/write operations
  • Connection pool exhaustion during instance rotation
  • DNS propagation delays when switching backends
  • Health check failures during application warm-up

Each of these is solvable. The solutions fall into a small number of deployment strategies, each with distinct tradeoffs in complexity, resource cost, and rollback speed.

Deployment Strategy Overview

Strategy Comparison Matrix

                    Blue-Green    Canary        Rolling       Recreate
                    ──────────    ──────        ───────       ────────
Downtime            Zero          Zero          Zero          Yes
Resource overhead   2x            1x + canary   1x + 1 pod   1x
Rollback speed      Seconds       Seconds       Minutes       Minutes
Infrastructure      Moderate      High          Low           Low
complexity
Database coupling   High          High          High          Low
Traffic control     Binary        Granular      None          None
Cost (relative)     1.8x          1.1x          1.0x          1.0x

Blue-Green Deployments

Blue-green deployment maintains two identical production environments. At any time, one environment (blue) serves all production traffic while the other (green) sits idle or serves as a staging target.

Blue-Green Traffic Flow

    Load Balancer / Ingress
           β”‚
           β”‚  100% traffic
           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   BLUE      β”‚         β”‚   GREEN     β”‚
    β”‚   (v1.2)    β”‚         β”‚   (v1.3)    β”‚
    β”‚   ACTIVE    β”‚         β”‚   IDLE      β”‚
    β”‚             β”‚         β”‚             β”‚
    β”‚  Pod 1      β”‚         β”‚  Pod 1      β”‚
    β”‚  Pod 2      β”‚         β”‚  Pod 2      β”‚
    β”‚  Pod 3      β”‚         β”‚  Pod 3      β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚                       β”‚
           β–Ό                       β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚         Shared Database             β”‚
    β”‚         (must be compatible         β”‚
    β”‚          with both versions)        β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


After switch:

    Load Balancer / Ingress
           β”‚
           β”‚  100% traffic
           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   BLUE      β”‚         β”‚   GREEN     β”‚
    β”‚   (v1.2)    β”‚         β”‚   (v1.3)    β”‚
    β”‚   IDLE      β”‚         β”‚   ACTIVE    β”‚
    β”‚             β”‚         β”‚             β”‚
    β”‚  Pod 1      β”‚         β”‚  Pod 1      β”‚
    β”‚  Pod 2      β”‚         β”‚  Pod 2      β”‚
    β”‚  Pod 3      β”‚         β”‚  Pod 3      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Kubernetes Implementation

A basic blue-green deployment in Kubernetes uses two Deployments and a Service selector switch.

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
  labels:
    app: myapp
    slot: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      slot: blue
  template:
    metadata:
      labels:
        app: myapp
        slot: blue
        version: v1.2.0
    spec:
      containers:
        - name: app
          image: registry.example.com/myapp:v1.2.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 3
            failureThreshold: 3
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
  labels:
    app: myapp
    slot: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      slot: green
  template:
    metadata:
      labels:
        app: myapp
        slot: green
        version: v1.3.0
    spec:
      containers:
        - name: app
          image: registry.example.com/myapp:v1.3.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 3
            failureThreshold: 3
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    slot: blue    # Change to "green" to switch traffic
  ports:
    - port: 80
      targetPort: 8080

The switch is a single kubectl patch command:

# Switch traffic from blue to green
kubectl patch service myapp \
  -p '{"spec":{"selector":{"slot":"green"}}}'
 
# Rollback: switch back to blue
kubectl patch service myapp \
  -p '{"spec":{"selector":{"slot":"blue"}}}'

Rollback time with this approach is under 2 seconds, limited only by kube-proxy propagation.

Blue-Green Limitations

The primary constraint is resource cost. Two full environments double compute expenses during the deployment window. For a service running 20 pods at 1 vCPU each, a blue-green deployment requires 40 vCPUs reserved. If deployments happen once per day and take 30 minutes of validation before the idle environment is scaled down, the overhead is (20 vCPUs * 0.5 hours * 30 days) = 300 vCPU-hours per month.

The second constraint is database compatibility. Both environments share the same database, so schema changes must be backward-compatible across both versions simultaneously.

Canary Deployments

Canary deployments route a small percentage of traffic to the new version, then gradually increase that percentage while monitoring error rates, latency, and business metrics.

Canary Traffic Progression Timeline

Time     Traffic Split         Error Rate    Action
─────    ─────────────         ──────────    ──────
T+0      v1: 100%  v2: 0%     0.01%         Deploy canary
T+5m     v1: 95%   v2: 5%     0.01%         Monitor
T+15m    v1: 90%   v2: 10%    0.02%         Within threshold
T+30m    v1: 75%   v2: 25%    0.02%         Promote
T+45m    v1: 50%   v2: 50%    0.01%         Promote
T+60m    v1: 25%   v2: 75%    0.01%         Promote
T+75m    v1: 0%    v2: 100%   0.01%         Complete


Canary Architecture

                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚       Ingress Controller      β”‚
                  β”‚    (traffic splitting logic)   β”‚
                  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚              β”‚
                    95% traffic     5% traffic
                         β”‚              β”‚
                         β–Ό              β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  Stable    β”‚  β”‚  Canary    β”‚
                  β”‚  (v1.2)    β”‚  β”‚  (v1.3)    β”‚
                  β”‚            β”‚  β”‚            β”‚
                  β”‚  19 pods   β”‚  β”‚  1 pod     β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚              β”‚
                         β–Ό              β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚    Metrics / Observability    β”‚
                  β”‚  (Prometheus, Datadog, etc.)  β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Argo Rollouts Canary Configuration

Argo Rollouts provides a Kubernetes-native canary deployment controller with automated analysis and promotion.

# canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: myapp
  strategy:
    canary:
      canaryService: myapp-canary
      stableService: myapp-stable
      trafficRouting:
        istio:
          virtualService:
            name: myapp-vsvc
            routes:
              - primary
      analysis:
        templates:
          - templateName: success-rate
          - templateName: latency-check
        startingStep: 2
        args:
          - name: service-name
            value: myapp-canary.default.svc.cluster.local
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 75
        - pause: { duration: 5m }
        - setWeight: 100
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: app
          image: registry.example.com/myapp:v1.3.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 3
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10

Automated Canary Analysis

Argo Rollouts supports AnalysisTemplate resources that query Prometheus (or other backends) to determine whether the canary is healthy.

# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.995
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}",
                status=~"2.."
              }[2m]
            )) /
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}"
              }[2m]
            ))
 
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-check
spec:
  args:
    - name: service-name
  metrics:
    - name: p99-latency
      interval: 60s
      successCondition: result[0] <= 500
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(
                http_request_duration_milliseconds_bucket{
                  service="{{args.service-name}}"
                }[2m]
              )) by (le)
            )

When the success rate drops below 99.5% or p99 latency exceeds 500ms across 3 consecutive checks, Argo Rollouts automatically aborts the rollout and scales the canary to zero.

Istio Traffic Splitting

For fine-grained traffic control, Istio VirtualService resources define exact traffic weights.

# istio-virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp-vsvc
spec:
  hosts:
    - myapp.example.com
  gateways:
    - myapp-gateway
  http:
    - name: primary
      route:
        - destination:
            host: myapp-stable
            port:
              number: 80
          weight: 95
        - destination:
            host: myapp-canary
            port:
              number: 80
          weight: 5
# istio-destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp.example.com
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
  subsets:
    - name: stable
      labels:
        version: stable
    - name: canary
      labels:
        version: canary

Header-Based Canary Routing

For internal testing before exposing the canary to real users, route traffic based on request headers:

# header-based-routing.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp-vsvc
spec:
  hosts:
    - myapp.example.com
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: myapp-canary
            port:
              number: 80
    - route:
        - destination:
            host: myapp-stable
            port:
              number: 80

This allows QA teams to test the canary in production by adding x-canary: true to their requests while all other traffic continues to reach the stable version.

Feature Flags

Feature flags decouple deployment from release. Code ships to production in a disabled state, then gets enabled for specific users, percentages, or conditions without redeployment.

Feature Flag Deployment Model

    Deploy v1.3 (feature disabled)     Enable for 5% of users
    ─────────────────────────────      ─────────────────────
           β”‚                                  β”‚
           β–Ό                                  β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  All Pods    β”‚                   β”‚  All Pods       β”‚
    β”‚  run v1.3   β”‚                   β”‚  run v1.3       β”‚
    β”‚             β”‚                   β”‚                 β”‚
    β”‚  Feature X: β”‚                   β”‚  Feature X:     β”‚
    β”‚  OFF        β”‚                   β”‚  5% of users ON β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                                  β”‚
    No user sees                       Flag service
    the feature                        controls rollout

    Enable for 50%         Enable for 100%       Remove flag
    ───────────────        ────────────────       ───────────
           β”‚                      β”‚                     β”‚
           β–Ό                      β–Ό                     β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Feature X:  β”‚       β”‚  Feature X:  β”‚      β”‚  Feature X:  β”‚
    β”‚  50% ON      β”‚       β”‚  100% ON     β”‚      β”‚  Code cleanedβ”‚
    β”‚              β”‚       β”‚              β”‚      β”‚  Flag removedβ”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation Pattern

// feature-flag-service.ts
interface FeatureFlag {
  name: string;
  enabled: boolean;
  percentage: number;      // 0-100
  allowlist: string[];     // user IDs always included
  blocklist: string[];     // user IDs always excluded
  metadata: Record<string, string>;
}
 
class FeatureFlagService {
  private flags: Map<string, FeatureFlag>;
  private cache: Map<string, { value: boolean; expiry: number }>;
  private readonly cacheTTL = 30_000; // 30 seconds
 
  constructor(private readonly provider: FlagProvider) {
    this.flags = new Map();
    this.cache = new Map();
  }
 
  async isEnabled(
    flagName: string,
    userId: string,
    attributes: Record<string, string> = {}
  ): Promise<boolean> {
    const cacheKey = `${flagName}:${userId}`;
    const cached = this.cache.get(cacheKey);
 
    if (cached && cached.expiry > Date.now()) {
      return cached.value;
    }
 
    const flag = await this.provider.getFlag(flagName);
 
    if (!flag || !flag.enabled) return false;
    if (flag.blocklist.includes(userId)) return false;
    if (flag.allowlist.includes(userId)) return true;
 
    // Deterministic percentage based on user ID hash
    const hash = this.hashUserId(userId, flagName);
    const bucket = hash % 100;
    const result = bucket < flag.percentage;
 
    this.cache.set(cacheKey, {
      value: result,
      expiry: Date.now() + this.cacheTTL,
    });
 
    return result;
  }
 
  private hashUserId(userId: string, salt: string): number {
    let hash = 0;
    const input = `${userId}:${salt}`;
    for (let i = 0; i < input.length; i++) {
      const char = input.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  }
}
// usage in request handler
async function handleRequest(req: Request): Promise<Response> {
  const userId = req.headers.get("x-user-id") ?? "anonymous";
 
  const useNewCheckout = await featureFlags.isEnabled(
    "new-checkout-flow",
    userId,
    { region: req.headers.get("x-region") ?? "us-east-1" }
  );
 
  if (useNewCheckout) {
    return newCheckoutHandler(req);
  }
 
  return legacyCheckoutHandler(req);
}

Feature Flag Lifecycle

Feature flags require active management. Stale flags accumulate as technical debt. A typical lifecycle:

Flag Lifecycle Timeline

Week 0      Week 1-2     Week 3-4     Week 5-6     Week 8
──────      ────────     ────────     ────────     ──────
Create      Ramp up      Full         Cleanup      Flag
flag        1% -> 50%    rollout      period       removed
  β”‚            β”‚           β”‚            β”‚            β”‚
  β–Ό            β–Ό           β–Ό            β–Ό            β–Ό
Code        Monitor      100% of      Remove       Dead code
merged      metrics      users on     flag checks  removed
with flag   and errors   new path     from code    from codebase

Organizations with more than 200 active feature flags at any time typically report increased incident rates from flag interaction bugs. A reasonable target is fewer than 50 active flags, with a maximum lifespan of 90 days per flag.

Database Migrations in Zero-Downtime Deployments

Database schema changes are the most common source of deployment-related downtime. A ALTER TABLE ... ADD COLUMN on a table with 50 million rows can acquire a lock for 30+ seconds in PostgreSQL, blocking all reads and writes.

The Expand-Contract Pattern

Safe schema changes follow a multi-phase approach across multiple deployments.

Expand-Contract Migration Timeline

Deploy 1 (Expand)         Deploy 2 (Migrate)       Deploy 3 (Contract)
─────────────────         ──────────────────        ───────────────────

Add new column            Backfill data             Drop old column
(nullable, no default)    Write to both columns     Remove old code paths
Deploy code that          Read from new column
writes to BOTH columns   Fall back to old column
Read from OLD column

Database State:

Deploy 1:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  users table                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  id    β”‚  email    β”‚  email_verified  β”‚   β”‚
β”‚  β”‚  (old) β”‚  (old)    β”‚  (NEW, nullable) β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚  1     β”‚  a@b.com  β”‚  NULL            β”‚   β”‚
β”‚  β”‚  2     β”‚  c@d.com  β”‚  NULL            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deploy 2:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  users table                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  id    β”‚  email    β”‚  email_verified  β”‚   β”‚
β”‚  β”‚  (old) β”‚  (old)    β”‚  (backfilled)    β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚  1     β”‚  a@b.com  β”‚  true            β”‚   β”‚
β”‚  β”‚  2     β”‚  c@d.com  β”‚  false           β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deploy 3:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  users table                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  id    β”‚  email    β”‚  email_verified  β”‚   β”‚
β”‚  β”‚        β”‚           β”‚  (NOT NULL)      β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚  1     β”‚  a@b.com  β”‚  true            β”‚   β”‚
β”‚  β”‚  2     β”‚  c@d.com  β”‚  false           β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Safe Migration Examples

Adding a column (PostgreSQL):

-- Deploy 1: Add nullable column (instant, no lock)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
 
-- Deploy 2: Backfill in batches (no lock, runs in background)
UPDATE users
SET email_verified = false
WHERE email_verified IS NULL
  AND id BETWEEN 1 AND 10000;
 
-- Repeat for each batch of 10,000 rows
-- Batch size depends on table write rate and acceptable replication lag
 
-- Deploy 3: Add NOT NULL constraint
-- PostgreSQL 12+ can validate existing rows without a full table lock
ALTER TABLE users
  ALTER COLUMN email_verified SET NOT NULL;

Renaming a column:

-- UNSAFE: This breaks all running application instances immediately
ALTER TABLE users RENAME COLUMN email TO email_address;
 
-- SAFE: Three-deploy approach
-- Deploy 1: Add new column, dual-write
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
CREATE OR REPLACE FUNCTION sync_email_columns()
  RETURNS TRIGGER AS $$
BEGIN
  IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
    IF NEW.email_address IS NULL THEN
      NEW.email_address := NEW.email;
    END IF;
    IF NEW.email IS NULL THEN
      NEW.email := NEW.email_address;
    END IF;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER trg_sync_email
  BEFORE INSERT OR UPDATE ON users
  FOR EACH ROW EXECUTE FUNCTION sync_email_columns();
 
-- Deploy 2: Backfill, switch reads to new column
UPDATE users SET email_address = email WHERE email_address IS NULL;
 
-- Deploy 3: Drop old column and trigger
DROP TRIGGER trg_sync_email ON users;
DROP FUNCTION sync_email_columns();
ALTER TABLE users DROP COLUMN email;

Large Table Migration Benchmarks

Migration performance varies by table size and operation type. These benchmarks are from PostgreSQL 15 on a db.r6g.xlarge (4 vCPU, 32 GB RAM) RDS instance.

Operation                      10M rows    50M rows    200M rows
─────────────────────────────  ────────    ────────    ─────────
ADD COLUMN (nullable)          < 10ms      < 10ms      < 10ms
ADD COLUMN (with default*)     < 10ms      < 10ms      < 10ms
ADD NOT NULL constraint        2.1s        9.8s        41.3s
CREATE INDEX CONCURRENTLY      45s         3m 20s      14m 10s
Backfill (batch of 10k)        120ms       120ms       120ms
Full backfill (all batches)    2m 10s      10m 30s     43m 15s

* PostgreSQL 11+ does not rewrite the table for ADD COLUMN with DEFAULT

Connection Draining

When removing an old instance from a load balancer, in-flight requests must complete before the instance shuts down. Dropping active connections results in HTTP 502 errors visible to users.

Connection Draining Timeline

  Load Balancer marks instance as "draining"
          β”‚
          β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Time ──────────────────────────────────────────▢  β”‚
  β”‚                                                   β”‚
  β”‚ New requests:  ───X  (stopped immediately)        β”‚
  β”‚                                                   β”‚
  β”‚ In-flight req 1: ════════╗                        β”‚
  β”‚                          β•šβ•β• completes (200 OK)   β”‚
  β”‚                                                   β”‚
  β”‚ In-flight req 2: ══════════════╗                  β”‚
  β”‚                                β•šβ•β• completes      β”‚
  β”‚                                                   β”‚
  β”‚ In-flight req 3: ═══════════════════════╗         β”‚
  β”‚                                         β•šβ•β• done  β”‚
  β”‚                                                   β”‚
  β”‚ Drain timeout (30s):  ─────────────────────────X  β”‚
  β”‚                                          SIGTERM  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Kubernetes Graceful Shutdown

Kubernetes sends SIGTERM to a pod, then waits terminationGracePeriodSeconds before sending SIGKILL. The application must handle SIGTERM by stopping acceptance of new connections and completing in-flight work.

# pod-with-graceful-shutdown.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: app
          image: registry.example.com/myapp:v1.3.0
          ports:
            - containerPort: 8080
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - "sleep 5 && kill -SIGTERM 1"
          # The preStop sleep gives the Endpoints controller
          # time to remove this pod from the Service before
          # the application starts rejecting connections.

The 5-second preStop sleep is important. Without it, there is a race condition: the kubelet sends SIGTERM to the container at the same time as the Endpoints controller updates the Service. If the application stops accepting connections before kube-proxy removes it from the Service, new requests route to a closed port and return 502.

Application-Level Drain Handling

// graceful shutdown in Go
package main
 
import (
    "context"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)
 
func main() {
    srv := &http.Server{
        Addr:         ":8080",
        Handler:      routes(),
        ReadTimeout:  10 * time.Second,
        WriteTimeout: 30 * time.Second,
        IdleTimeout:  60 * time.Second,
    }
 
    // Channel to listen for OS signals
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
 
    go func() {
        log.Printf("server starting on %s", srv.Addr)
        if err := srv.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("server error: %v", err)
        }
    }()
 
    // Block until signal received
    sig := <-quit
    log.Printf("received signal %s, starting graceful shutdown", sig)
 
    // Create deadline context for shutdown
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
 
    // Shutdown stops accepting new connections and waits
    // for in-flight requests to complete
    if err := srv.Shutdown(ctx); err != nil {
        log.Printf("forced shutdown: %v", err)
    }
 
    log.Println("server stopped")
}

Rolling Deployments

Rolling deployments replace pods one at a time (or in small batches). This is the default Kubernetes deployment strategy.

Rolling Update Progression (5 replicas, maxSurge=1, maxUnavailable=0)

Step 1:  [v1] [v1] [v1] [v1] [v1] [v2 starting]
Step 2:  [v1] [v1] [v1] [v1] [v2]  ← v2 ready, terminate one v1
Step 3:  [v1] [v1] [v1] [v2] [v2 starting]
Step 4:  [v1] [v1] [v1] [v2] [v2]  ← terminate next v1
Step 5:  [v1] [v1] [v2] [v2] [v2 starting]
  ...
Step 9:  [v1] [v2] [v2] [v2] [v2]  ← terminate last v1
Step 10: [v2] [v2] [v2] [v2] [v2]  ← rollout complete
# rolling-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1          # At most 1 extra pod during update
      maxUnavailable: 0    # All existing pods must stay available
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: app
          image: registry.example.com/myapp:v1.3.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
            successThreshold: 2
          startupProbe:
            httpGet:
              path: /healthz
              port: 8080
            failureThreshold: 30
            periodSeconds: 2
            # Allows up to 60 seconds for application startup

Rolling Update Tradeoffs

The rolling strategy has lower resource overhead (only 1 extra pod) but creates a period where both v1 and v2 serve traffic simultaneously. For APIs with breaking changes between versions, this mixed state can cause client errors. The total rollout time scales linearly with replica count.

Rollout Duration Estimates (maxSurge=1, maxUnavailable=0)

Replicas    Startup Time    Total Rollout
────────    ────────────    ─────────────
3           10s             ~45s
5           10s             ~75s
10          10s             ~2m 30s
20          10s             ~5m
50          10s             ~12m 30s
100         10s             ~25m

Setting maxSurge: 25% and maxUnavailable: 25% reduces rollout time by approximately 4x at the cost of temporary capacity reduction.

Health Checks and Readiness Gates

Proper health checks prevent traffic from reaching instances that are not ready to serve. Kubernetes distinguishes three probe types:

Probe Types and Their Effects

Probe         Failure Action              Use Case
─────         ──────────────              ────────
startup       Delay other probes          Slow-starting apps (JVM warm-up,
                                          cache loading, model loading)

readiness     Remove from Service         Temporarily unable to serve
              endpoints                   (DB connection lost, downstream
                                          dependency down)

liveness      Restart the container       Deadlocked process, corrupted
                                          state, unrecoverable error

A common misconfiguration: using liveness probes for transient failures. If a database is temporarily unreachable, a liveness probe failure restarts the container. But restarting does not fix the database. The result is a crash loop that amplifies the outage.

Correct approach: the readiness probe returns unhealthy when the database is unreachable (removing the pod from the Service), while the liveness probe only fails when the process itself is unresponsive.

# health-check-configuration.yaml
containers:
  - name: app
    image: registry.example.com/myapp:v1.3.0
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 30
      periodSeconds: 2
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      periodSeconds: 5
      failureThreshold: 2
      successThreshold: 2
    livenessProbe:
      httpGet:
        path: /alive
        port: 8080
      periodSeconds: 15
      failureThreshold: 3
      initialDelaySeconds: 0
      # initialDelaySeconds is 0 because the startup probe
      # gates liveness checks until the app is ready
// Health check endpoints
func healthzHandler(w http.ResponseWriter, r *http.Request) {
    // Basic process health. Returns 200 if the process is running.
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("ok"))
}
 
func readyHandler(w http.ResponseWriter, r *http.Request) {
    // Check all dependencies required to serve traffic
    if err := db.PingContext(r.Context()); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte("database unreachable"))
        return
    }
    if err := cache.Ping(r.Context()); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte("cache unreachable"))
        return
    }
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("ready"))
}
 
func aliveHandler(w http.ResponseWriter, r *http.Request) {
    // Only fail if the process is fundamentally broken.
    // Do NOT check external dependencies here.
    select {
    case <-deadlockDetector:
        w.WriteHeader(http.StatusInternalServerError)
        w.Write([]byte("deadlock detected"))
    default:
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("alive"))
    }
}

Deployment Strategy Decision Matrix

Choosing a strategy depends on the constraints of the system: traffic volume, acceptable risk, infrastructure budget, and team operational maturity.

Decision Criteria Matrix

Constraint                    Recommended Strategy
──────────────────────        ────────────────────
Budget-constrained            Rolling update
Need instant rollback         Blue-green
Gradual risk reduction        Canary
Decouple deploy from release  Feature flags
Database schema changes       Expand-contract + any of the above
Stateful services             Blue-green with connection draining
Multiple coupled services     Feature flags + canary
Regulatory/compliance needs   Blue-green (clear audit trail)


Risk vs. Complexity Comparison

                      Low complexity ◄──────────────► High complexity
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  Low risk            β”‚                                              β”‚
    β–²                 β”‚  Feature flags    Canary + automated         β”‚
    β”‚                 β”‚  + canary         analysis + feature flags   β”‚
    β”‚                 β”‚                                              β”‚
    β”‚                 β”‚  Canary with      Blue-green + canary        β”‚
    β”‚                 β”‚  manual gates     + feature flags            β”‚
    β”‚                 β”‚                                              β”‚
    β”‚                 β”‚  Blue-green       Rolling + blue-green       β”‚
    β”‚                 β”‚                   (multi-region)             β”‚
    β”‚                 β”‚                                              β”‚
    β”‚                 β”‚  Rolling update   Recreate (scheduled        β”‚
  High risk           β”‚                   maintenance window)        β”‚
    β–Ό                 β”‚                                              β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Production Deployment Checklist

A minimal set of verification steps before, during, and after deployment:

Pre-deployment:

  • Database migration is backward-compatible with current running version
  • Health check endpoints are implemented and tested
  • Rollback procedure is documented and tested within the last 30 days
  • Feature flags are configured for any user-facing changes
  • Alerting thresholds are set for error rate, latency, and saturation

During deployment:

  • Monitor error rate delta between canary and stable (threshold: < 0.1% difference)
  • Monitor p99 latency delta (threshold: < 50ms increase)
  • Monitor CPU and memory utilization on new pods (threshold: < 80%)
  • Verify database connection pool is not exhausted
  • Confirm no increase in 5xx responses from downstream dependencies

Post-deployment:

  • Verify all pods are running the expected version
  • Confirm old ReplicaSet has scaled to zero
  • Run smoke tests against production endpoints
  • Check that async job processors are consuming from the correct queues
  • Verify CDN cache invalidation completed if static assets changed

Metrics to Track

The following metrics provide observability into deployment health:

Metric                          Source              Alert Threshold
──────                          ──────              ───────────────
Deployment frequency            CI/CD pipeline      N/A (track trend)
Lead time for changes           Git + CI/CD         > 24 hours
Change failure rate             Incident tracker    > 5%
Mean time to recovery (MTTR)    Incident tracker    > 30 minutes
Rollback count (weekly)         CI/CD pipeline      > 2 per week
Deployment duration             Argo / k8s events   > 15 minutes
Pod restart count (post-deploy) Kubernetes metrics  > 0
Error rate delta (deploy)       Prometheus          > 0.1%
p99 latency delta (deploy)     Prometheus          > 50ms

These align with the DORA metrics framework. Organizations performing in the "elite" category deploy multiple times per day with a change failure rate below 5% and MTTR under 1 hour.

Summary of Key Numbers

  • Blue-green rollback time: < 2 seconds
  • Canary minimum observation window: 5 minutes per weight step
  • Feature flag cache TTL: 30 seconds (balances freshness vs. load)
  • Pre-stop sleep for Kubernetes pods: 5 seconds
  • Graceful shutdown timeout: 30 to 60 seconds
  • Database backfill batch size: 5,000 to 10,000 rows
  • Maximum concurrent feature flags: fewer than 50
  • Feature flag maximum lifespan: 90 days
  • Health check readiness probe interval: 5 seconds
  • Liveness probe failure threshold: 3 (avoid restart loops)