Observability Engineering: From Logs to Traces to Understanding

July 29, 2023

Monitoring answers whether a system is functioning. Observability answers why it is not. The distinction matters in distributed architectures where failures cascade across service boundaries and root causes are rarely local to the component that surfaces the error.

This document covers the core engineering practices for building observable systems: structured logging, distributed tracing, metrics with controlled cardinality, and SLO-based alerting. The tooling references center on the Grafana/Prometheus/OpenTelemetry ecosystem, though the principles apply regardless of vendor.

Signal Hierarchy

Logs, metrics, and traces are commonly described as the "three pillars of observability." In practice, their utility during incident response is not equal.

  • Traces reveal the execution path of a specific request across services. They are the most useful signal for diagnosing latency and failure propagation.
  • Metrics provide aggregate behavior over time. They are the primary mechanism for detecting anomalies and triggering alerts.
  • Logs provide arbitrary detail for cases where traces and metrics lack sufficient context. At scale, log-based debugging without trace correlation is operationally expensive.

The debugging workflow differs substantially depending on which signals are available.

  Debugging with logs only:

  +---------+     +---------+     +----------+     +---------+
  | Alert   | --> | Search  | --> | Find one | --> | Attempt |
  | fires   |     | through |     | relevant |     | to      |
  |         |     | 500 GB  |     | log line |     | identify|
  |         |     | of logs |     | (maybe)  |     | service |
  +---------+     +---------+     +----------+     +---------+
                                                   Time: ~2 hours

  Debugging with traces + metrics + logs:

  +---------+     +-----------+     +---------+     +---------+
  | Alert   | --> | Open      | --> | Click   | --> | See the |
  | fires   |     | trace for |     | into    |     | exact   |
  |         |     | failing   |     | slow    |     | log     |
  |         |     | request   |     | span    |     | line    |
  +---------+     +-----------+     +---------+     +---------+
                                                   Time: ~5 minutes

The difference in mean time to resolution (MTTR) between these two workflows is typically an order of magnitude. Organizations that deploy distributed tracing commonly report MTTR reductions from hours to minutes.

Structured Logging

Unstructured log messages are effectively opaque to automated processing. When services number in the dozens or hundreds, each with different log formats, programmatic correlation becomes impractical.

// Unstructured: not machine-parseable, not correlatable
console.log(`User ${userId} placed order ${orderId} for $${amount}`);
 
// Structured: every field is queryable, filterable, and correlatable
logger.info({
  event: 'order_placed',
  userId,
  orderId,
  amount,
  currency: 'USD',
  paymentMethod: 'stripe',
  traceId: context.traceId,
  spanId: context.spanId,
  latencyMs: Date.now() - startTime,
});

Structured Logging Requirements

  1. Every log entry must include a trace ID. Without trace-to-log correlation, there is no way to associate a log entry with the distributed trace that produced it. Grafana Loki and Tempo support native linking between log entries and traces via the trace ID field.

  2. Field names must be consistent across all services. If one service uses userId, another uses user_id, and a third uses uid, cross-service queries in Loki require unions across all variants. Standardize field names in a shared schema document and enforce compliance through linting or code review.

  3. Log at service boundaries, not inside tight loops. Placing logger.debug() inside a loop iterating over 100,000 items generates log volume that can saturate the log aggregation pipeline. Log on request entry, request exit, and error conditions.

  4. Include business context. A log entry reading "Connection timeout after 30s" is less actionable than "Connection timeout after 30s, order_id=ord_8x7k2m, user_id=usr_4n9p1q, region=us-east-1."

Logging Levels and Their Intended Use

LevelPurposeProduction sampling
errorUnrecoverable failures requiring attention100%, always retained
warnDegraded operation, recoverable conditions100%, always retained
infoRequest lifecycle events, state transitions100% or sampled at 50%
debugDetailed internal state for development useSampled at 1-10%
traceExtremely verbose, per-iteration outputDisabled in production

Log Aggregation Pipeline Architecture

  Log Aggregation Flow:

  +-------------+     +-------------+     +-----------+
  | Service A   |---->|             |     |           |
  | (stdout)    |     |             |     |           |
  +-------------+     |  Fluent Bit |     |   Loki    |
                      |  or         |---->|   (or     |
  +-------------+     |  Fluentd    |     |  Elastic) |
  | Service B   |---->|             |     |           |
  | (stdout)    |     | (shipper)   |     | (store)   |
  +-------------+     +------+------+     +-----+-----+
                             |                  |
  +-------------+            |            +-----+-----+
  | Service C   |------+-----+            |  Grafana  |
  | (stdout)    |                         |  (query)  |
  +-------------+                         +-----------+

  Failure modes at each stage:

  1. Service:  Excessive log volume     --> disk pressure, OOM
  2. Shipper:  Buffer overflow          --> dropped log entries
  3. Network:  Backpressure             --> shipper blocks, app slows
  4. Store:    Ingestion rate exceeded  --> HTTP 429, data loss
  5. Query:    Expensive query          --> timeout, no results

A known failure pattern: the log shipper (Fluent Bit) buffers to disk when the storage backend (Loki) returns HTTP 429 rate-limit responses. If disk buffer limits are not configured, the buffer fills the node disk, triggering kubelet eviction of pods, which generates additional log volume, increasing backpressure further. This is a positive feedback loop. The mitigation is to configure hard disk buffer limits on the shipper so it drops logs rather than filling the disk. Dropping debug-level logs under backpressure is preferable to node-level disruption.

Fluent Bit Configuration for Kubernetes

# fluent-bit configmap
[SERVICE]
    Flush         5
    Log_Level     info
    Daemon        off
    HTTP_Server   On
    HTTP_Listen   0.0.0.0
    HTTP_Port     2020
    storage.path  /var/log/flb-storage/
    storage.sync  normal
    storage.backlog.mem_limit 50M
 
[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            cri
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     10MB
    Skip_Long_Lines   On
    Refresh_Interval  10
    storage.type      filesystem
 
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_Tag_Prefix     kube.var.log.containers.
    Merge_Log           On
    K8S-Logging.Parser  On
 
[FILTER]
    Name    grep
    Match   kube.*
    Exclude log ^$/
 
[OUTPUT]
    Name          loki
    Match         kube.*
    Host          loki-gateway.monitoring.svc
    Port          80
    Labels        job=fluent-bit, namespace=$kubernetes['namespace_name'], pod=$kubernetes['pod_name']
    Label_Keys    $level
    BatchWait     1
    BatchSize     1048576
    LineFormat    json
    AutoKubernetesLabels off

Distributed Tracing Architecture

A trace represents a single request's path through a distributed system. Each unit of work is represented as a span. Spans have parent-child relationships that form a directed acyclic graph, typically visualized as a waterfall.

  Trace Waterfall: POST /api/orders
  Trace ID: 7f3a8b2c-d4e5-4f6a-b7c8-9d0e1f2a3b4c

  |<---------- 312ms total ---------->|
  |                                    |
  [API Gateway] POST /orders ============================== 312ms
    |
    |- [Auth Service] validateToken === 12ms
    |   |
    |   '- [Redis] GET session:tok_9x2 = 1ms
    |
    |- [Order Service] createOrder ======================= 280ms
    |   |
    |   |- [PostgreSQL] INSERT orders ======= 28ms
    |   |
    |   |- [Inventory Service] reserveItems ========= 62ms
    |   |   |
    |   |   |- [PostgreSQL] SELECT inventory ==== 18ms
    |   |   |
    |   |   '- [Redis] DECR inventory:sku_841 = 2ms
    |   |
    |   '- [Payment Service] charge ================= 180ms
    |       |
    |       '- [Stripe API] POST /v1/charges ===== 172ms
    |                                     External API call.
    |                                     Dominates total latency.
    |                                     Options: caching, async
    |                                     processing, timeout tuning.
    |
    '- [Notification Service] sendConfirmation = 5ms (async)
        |
        '- [SES] SendEmail = 45ms (async, not on critical path)

This waterfall immediately identifies that 172ms of the 312ms total is spent in the Stripe API call. The payment service itself adds only 8ms of overhead. This distinction, whether the bottleneck is internal or external, is visible in seconds rather than requiring log correlation across multiple services.

OpenTelemetry SDK Setup

OpenTelemetry is the CNCF standard for instrumentation. It provides a vendor-neutral API and SDK with exporters for Jaeger, Tempo, Zipkin, and commercial backends.

// tracing.ts - OpenTelemetry initialization
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import {
  SEMRESATTRS_SERVICE_NAME,
  SEMRESATTRS_SERVICE_VERSION,
  SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,
} from '@opentelemetry/semantic-conventions';
 
const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: 'order-service',
    [SEMRESATTRS_SERVICE_VERSION]: '1.4.2',
    [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'grpc://otel-collector.monitoring.svc:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'grpc://otel-collector.monitoring.svc:4317',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/healthz', '/readyz', '/metrics'],
      },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
      '@opentelemetry/instrumentation-grpc': { enabled: true },
    }),
  ],
});
 
sdk.start();
 
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});

Manual Span Creation

Auto-instrumentation covers HTTP, gRPC, and database calls. Business logic operations require manual spans.

import { trace, SpanStatusCode } from '@opentelemetry/api';
 
const tracer = trace.getTracer('order-service', '1.4.2');
 
async function createOrder(req: OrderRequest): Promise<Order> {
  return tracer.startActiveSpan('createOrder', async (span) => {
    try {
      span.setAttribute('order.items_count', req.items.length);
      span.setAttribute('order.total_cents', req.totalCents);
      span.setAttribute('order.currency', req.currency);
      span.setAttribute('user.id', req.userId);
      span.setAttribute('user.tier', req.userTier);
 
      const order = await db.insert('orders', req);
      span.setAttribute('order.id', order.id);
 
      await tracer.startActiveSpan('reserveInventory', async (invSpan) => {
        invSpan.setAttribute('inventory.sku_count', req.items.length);
        for (const item of req.items) {
          invSpan.addEvent('reserving_item', {
            'item.sku': item.sku,
            'item.quantity': item.quantity,
          });
        }
        await inventoryService.reserve(req.items);
        invSpan.end();
      });
 
      await tracer.startActiveSpan('processPayment', async (paySpan) => {
        paySpan.setAttribute('payment.method', req.paymentMethod);
        paySpan.setAttribute('payment.provider', 'stripe');
        const charge = await paymentService.charge(req);
        paySpan.setAttribute('payment.charge_id', charge.id);
        paySpan.end();
      });
 
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Context Propagation

Tracing requires trace context to propagate across service boundaries. Without propagation, traces fragment into disconnected spans. OpenTelemetry propagates context automatically for HTTP calls via the W3C traceparent header. Other transport mechanisms require explicit handling.

  Context Propagation Across Transports:

  Service A                    Service B                    Service C
  +--------+                   +--------+                   +--------+
  | Span 1 |--- HTTP req ---->| Span 2 |--- gRPC req ---->| Span 3 |
  |        |  traceparent:    |        |  traceparent:     |        |
  |        |  00-7f3a8b2c-    |        |  00-7f3a8b2c-     |        |
  |        |  span1id-01      |        |  span2id-01       |        |
  +--------+                   +--------+                   +--------+
      |
      |         Kafka message
      |  +----------------------+
      |  | headers:             |
      |  |   traceparent: ...   |
      |  | body: { ... }        |
      |  +----------------------+
      |            |
      |            v
      |       +--------+
      |       | Worker  |
      |       | Span 4  |  <-- same trace ID, new span ID
      |       +--------+
      |
      All spans share trace ID 7f3a8b2c.

Transport-specific propagation requirements:

  • Kafka: Embed trace context in Kafka message headers. The @opentelemetry/instrumentation-kafkajs package handles this automatically for KafkaJS.
  • SQS/SNS: Place trace context in message attributes. AWS X-Ray headers can coexist with W3C traceparent.
  • Cron jobs and async workers: No incoming request context exists. Create a new root span. If the job was triggered by a prior request, use span links to associate the new trace with the originating trace.
  • Thread/goroutine pools: Context does not transfer across threads automatically. In Node.js, AsyncLocalStorage handles this when the OTel SDK is initialized before application code.

Sampling Strategies

At non-trivial request volumes, storing 100% of traces is neither economical nor necessary. A service processing 10,000 requests per second at an average of 5 spans per trace with 1 KB per span generates approximately 50 MB/s of trace data, or 4.3 TB per day.

Sampling strategies, ordered by sophistication:

Head-based (probabilistic) sampling. The sampling decision is made at the root span and propagated to all child spans. Simple to implement. The drawback is that rare events (errors, high-latency outliers) are sampled out at the same rate as normal traffic.

Tail-based sampling. The OpenTelemetry Collector buffers complete traces and applies sampling rules after the trace finishes. This enables policies such as "keep 100% of error traces, 100% of traces exceeding p99 latency, and 5% of successful traces." The tradeoff is operational complexity: the collector must buffer traces in memory, which requires sizing for peak throughput.

Always-sample errors. Regardless of the base sampling rate, retain 100% of traces containing error spans. This is the minimum viable sampling policy for production systems.

A practical configuration: 5% head-based sampling for normal traffic, 100% sampling for errors and requests exceeding the p99 latency threshold.

# OpenTelemetry Collector configuration with tail-based sampling
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 5000
    policies:
      - name: keep-all-errors
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: keep-slow-traces
        type: latency
        latency:
          threshold_ms: 2000
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
 
  batch:
    timeout: 5s
    send_batch_size: 8192
    send_batch_max_size: 16384
 
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
 
exporters:
  otlp/tempo:
    endpoint: tempo-distributor.monitoring.svc:4317
    tls:
      insecure: true
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]

Metrics and Cardinality

Metrics provide aggregate views of system behavior over time. They are the primary input for dashboards and alerts. The critical constraint is cardinality: every unique combination of label values creates a distinct time series.

Cardinality Math

Consider a metric with labels {method, endpoint, status_bucket, region}:

LabelCardinalityExample values
method5GET, POST, PUT, DELETE, PATCH
endpoint50/api/orders, /api/users, ...
status_bucket52xx, 3xx, 4xx, 5xx, timeout
region4us-east-1, us-west-2, eu-west-1, ap-southeast-1

Total time series: 5 x 50 x 5 x 4 = 5,000. This is manageable for Prometheus.

Adding userId with 6 million unique values: 5 x 50 x 5 x 4 x 6,000,000 = 30 billion time series. Prometheus will be OOM-killed. This is the single most common way to take down a Prometheus instance.

// DANGEROUS: unbounded cardinality
// userId has millions of unique values
counter.add(1, {
  endpoint: '/api/orders',
  userId: req.userId,  // 6M unique values
});
// Result: millions of time series, Prometheus OOM
 
// SAFE: bounded cardinality
counter.add(1, {
  endpoint: '/api/orders',
  method: 'POST',
  status: `${Math.floor(res.statusCode / 100)}xx`,
  region: 'us-east-1',
});
// Result: ~1000 time series total

General guidance: keep total time series per metric below 10,000. Keep total active time series per Prometheus instance below 5 million for reliable performance. Above that, consider sharding with Thanos, Cortex, or Mimir.

High-cardinality identifiers (user ID, request ID, order ID) belong in trace attributes and log fields, not in metric labels.

The RED Method

For service-level metrics, the RED method provides the essential signals:

  • Rate: requests per second
  • Errors: error rate as a percentage of total requests
  • Duration: latency distribution (histogram, not average)

Averages hide tail latency. An average latency of 100ms could mean uniform 100ms responses, or it could mean 99% of responses at 10ms and 1% at 9,100ms. The 1% tail is invisible in the average but represents a significant user experience degradation. Always use histograms and query percentiles (p50, p95, p99).

import { metrics } from '@opentelemetry/api';
 
const meter = metrics.getMeter('order-service');
 
const requestDuration = meter.createHistogram('http_request_duration_ms', {
  description: 'HTTP request duration in milliseconds',
  unit: 'ms',
  advice: {
    explicitBucketBoundaries: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
  },
});
 
const requestCounter = meter.createCounter('http_requests_total', {
  description: 'Total HTTP requests',
});
 
const activeRequests = meter.createUpDownCounter('http_active_requests', {
  description: 'Currently active HTTP requests',
});
 
// Express middleware
app.use((req, res, next) => {
  const start = performance.now();
  activeRequests.add(1, { method: req.method });
 
  res.on('finish', () => {
    const duration = performance.now() - start;
    const labels = {
      method: req.method,
      route: req.route?.path || 'unknown',
      status: `${Math.floor(res.statusCode / 100)}xx`,
    };
 
    requestDuration.record(duration, labels);
    requestCounter.add(1, labels);
    activeRequests.add(-1, { method: req.method });
  });
 
  next();
});

Histogram Bucket Selection

Bucket boundaries should align with SLO thresholds. If the SLO specifies "99% of requests complete in under 500ms," buckets around that boundary need sufficient resolution.

  Default buckets:     [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000]

  Bucket count tradeoff:
  +------------------+--------------------+-----------------------+
  | Fewer buckets    | Lower cardinality  | Less accurate         |
  | (e.g., 5)        | ~5x time series    | quantile estimates    |
  +------------------+--------------------+-----------------------+
  | More buckets     | Higher cardinality | More accurate         |
  | (e.g., 20)       | ~20x time series   | quantile estimates    |
  +------------------+--------------------+-----------------------+

  For a histogram with 10 buckets and 100 label combinations:
  Time series = 100 * (10 buckets + 2 for _sum and _count) = 1,200

  For native histograms (Prometheus 2.40+):
  Time series = 100 * 1 = 100 (buckets stored within the series)

Prometheus native histograms (available from version 2.40) store bucket data within a single time series rather than creating one series per bucket. This dramatically reduces cardinality for histogram metrics. Grafana Mimir and Tempo also support native histograms.

PromQL Reference

Common patterns and their correct usage:

rate() vs irate(): rate() computes the per-second average rate over the entire range window. irate() computes the instantaneous rate using only the last two data points. Use rate() for alerting (smoother signal, fewer false positives). Use irate() for dashboards where short-lived spikes should be visible.

Aggregation order: Apply rate() before sum(). The expression sum(rate(requests_total[5m])) is correct. The expression rate(sum(requests_total)[5m]) does not produce valid results because sum() creates a synthetic series that rate() cannot interpret as a monotonic counter.

Histogram quantiles: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) produces an estimate whose accuracy depends on bucket boundary placement. If the true p99 falls between widely spaced buckets, the estimate will be imprecise due to linear interpolation.

# Recording rules: pre-compute expensive PromQL expressions
groups:
  - name: service_slo_rules
    interval: 30s
    rules:
      - record: job:http_request_duration_seconds:p50
        expr: >
          histogram_quantile(0.50,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )
 
      - record: job:http_request_duration_seconds:p95
        expr: >
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )
 
      - record: job:http_request_duration_seconds:p99
        expr: >
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )
 
      - record: job:http_request_errors:ratio_rate5m
        expr: >
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)
 
      - record: job:http_requests:rate5m
        expr: >
          sum(rate(http_requests_total[5m])) by (job)

Prometheus Scaling Topology

A single Prometheus instance handles approximately 1-2 million active time series with default resource allocations. Beyond that, or when multi-region aggregation is required, a federation or remote-write architecture is necessary.

  Prometheus Scaling with Thanos:

  Region: us-east-1              Region: eu-west-1
  +-------------------+          +-------------------+
  | Prometheus        |          | Prometheus        |
  | (local scrape)    |          | (local scrape)    |
  |                   |          |                   |
  | Thanos Sidecar    |          | Thanos Sidecar    |
  +--------+----------+          +--------+----------+
           |                              |
           | remote write / StoreAPI      |
           v                              v
  +--------+------------------------------+----------+
  |              Thanos Querier                       |
  |              (global query layer)                 |
  |                                                   |
  |  +-------------+    +-----------+                 |
  |  | Thanos      |    | Thanos    |                 |
  |  | Store       |    | Compactor |                 |
  |  | Gateway     |    |           |                 |
  |  +------+------+    +-----+-----+                 |
  |         |                 |                        |
  +---------+-----------------+------------------------+
            |                 |
            v                 v
  +---------+-----------------+---------+
  |         Object Storage (S3/GCS)     |
  |                                     |
  |  - Long-term retention (years)      |
  |  - Downsampled data for old ranges  |
  |  - Cost: ~$0.023/GB/month (S3)      |
  +-------------------------------------+
            |
            v
  +---------+---------+
  |      Grafana      |
  |  (query Thanos    |
  |   Querier)        |
  +-------------------+

Storage cost estimates for metrics:

ScaleActive seriesIngestion rate90-day S3 cost
Small500K~50K samples/s~$15/month
Medium2M~200K samples/s~$60/month
Large10M~1M samples/s~$300/month
Very large50M+~5M samples/s~$1,500/month

These estimates assume Thanos compaction with downsampling enabled (5-minute resolution after 30 days, 1-hour resolution after 90 days).

Exemplars: Bridging Metrics and Traces

Exemplars attach a trace ID to a specific metric observation. When viewing a metric in Grafana, individual data points can link directly to the trace that produced them.

  Exemplar Flow:

  Grafana Dashboard                   Trace Backend (Tempo)
  +-----------------------------+
  | p99 Latency                 |
  |                        *    |     Click the       +------------------+
  |                       / \   | --- exemplar -----> | Trace 7f3a8b2c   |
  |     *    *           /   \  |                     | Waterfall shows  |
  |    / \  / \    *    /     * |                     | exact request    |
  |   /   \/   \  / \  /       |                     | that caused the  |
  |  /          \/   \/        |                     | latency spike    |
  +-----------------------------+                     +------------------+

  Requirements:
  1. Metrics SDK records exemplars (OpenTelemetry does this)
  2. Storage supports exemplars (Prometheus 2.26+)
  3. Visualization renders exemplars (Grafana 8.0+)

When all three components support exemplars, the workflow from aggregate metric anomaly to specific request trace requires a single click. This is the defining capability that separates correlated observability from disconnected signal collection.

Correlation: The Complete Signal Path

The full observability workflow connects all three signal types in a single investigation path.

  The signal correlation path:

  +----------+     +-----------+     +-----------+     +--------+
  | Alert:   |     | Grafana   |     | Tempo     |     | Loki   |
  | "p99 >   |---->| Dashboard |---->| Trace     |---->| Logs   |
  |  500ms"  |     | (click    |     | Waterfall |     | for    |
  |          |     |  exemplar)|     | (click    |     | trace  |
  |          |     |           |     |  span)    |     | 7f3a.. |
  +----------+     +-----------+     +-----------+     +--------+

  Time from alert to root cause: ~2 minutes

  Requirements for this workflow:
  1. Trace IDs in log entries (structured logging)
  2. Exemplars on metrics (OpenTelemetry SDK)
  3. Data source linking in Grafana (Tempo <-> Loki, Tempo <-> Prometheus)
  4. Consistent label mapping (service_name, namespace) across all signals

Grafana data source configuration for correlation:

# Grafana provisioning: datasources.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://thanos-querier.monitoring.svc:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo
          urlDisplayLabel: View Trace
 
  - name: Tempo
    type: tempo
    uid: tempo
    url: http://tempo-query-frontend.monitoring.svc:3100
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        filterByTraceID: true
        filterBySpanID: false
        mapTagNamesEnabled: true
        mappedTags:
          - key: service.name
            value: service_name
      tracesToMetrics:
        datasourceUid: prometheus
        queries:
          - name: Request rate
            query: sum(rate(http_requests_total{service_name="${__span.tags.service.name}"}[5m]))
          - name: Error rate
            query: sum(rate(http_requests_total{service_name="${__span.tags.service.name}", status=~"5.."}[5m]))
 
  - name: Loki
    type: loki
    uid: loki
    url: http://loki-gateway.monitoring.svc:80
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"traceId":"(\w+)"'
          url: '$${__value.raw}'
          datasourceUid: tempo

SLO-Based Alerting

Infrastructure-centric alerts (CPU utilization, memory usage, disk space) are symptoms, not problems. A system can operate at 95% CPU utilization with zero customer impact. Conversely, a deadlocked process uses minimal CPU while causing a complete outage.

SLO-based alerting measures customer-facing impact directly and alerts when the system is consuming its error budget faster than the budget period allows.

Error Budget Calculation

For a 99.9% availability SLO over a 30-day window:

  • Error budget: 0.1% of total requests
  • At 1,000 requests/minute: 1,440,000 requests/day, 43,200,000 requests/month
  • Budget: 43,200 failed requests per month
  • Equivalent downtime: approximately 43 minutes per month

Multi-Window, Multi-Burn-Rate Alerts

The multi-burn-rate approach, described in the Google SRE Workbook, alerts at different severity levels based on how fast the error budget is being consumed.

# SLO: 99.9% of requests succeed (non-5xx) within 500ms
# Error budget: 0.1%
 
groups:
  - name: slo_alerts
    rules:
      # FAST BURN: 14.4x budget consumption rate
      # At this rate, monthly budget exhausted in ~50 hours
      # Severity: page (wake someone up)
      - alert: HighErrorBudgetBurn_Critical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > (0.001 * 14.4)
          AND
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > (0.001 * 14.4)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Error budget burn rate 14.4x sustainable rate"
          impact: "Monthly error budget will be exhausted in ~50 hours"
          runbook_url: "https://runbooks.internal/slo-burn-critical"
 
      # MEDIUM BURN: 6x budget consumption rate
      # At this rate, monthly budget exhausted in ~5 days
      # Severity: ticket (fix during business hours)
      - alert: HighErrorBudgetBurn_Warning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > (0.001 * 6)
          AND
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            / sum(rate(http_requests_total[30m]))
          ) > (0.001 * 6)
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Elevated error budget burn rate (6x)"
          impact: "Monthly error budget will be exhausted in ~5 days"
          runbook_url: "https://runbooks.internal/slo-burn-warning"
 
      # SLOW BURN: 3x budget consumption rate
      # At this rate, monthly budget exhausted in ~10 days
      # Severity: ticket (fix this week)
      - alert: HighErrorBudgetBurn_Slow
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1d]))
            / sum(rate(http_requests_total[1d]))
          ) > (0.001 * 3)
          AND
          (
            sum(rate(http_requests_total{status=~"5.."}[2h]))
            / sum(rate(http_requests_total[2h]))
          ) > (0.001 * 3)
        for: 30m
        labels:
          severity: ticket
        annotations:
          summary: "Slow error budget burn detected (3x)"
          impact: "Monthly error budget will be exhausted in ~10 days"
          runbook_url: "https://runbooks.internal/slo-burn-slow"
 
      # Latency SLO: 99% of requests under 500ms
      - alert: LatencySLOBreach
        expr: |
          (
            1 - (
              sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
              / sum(rate(http_request_duration_seconds_count[5m]))
            )
          ) > 0.01
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Latency SLO breach: >1% of requests exceeding 500ms"

The dual-window approach (long window AND short window) reduces false positives. A brief spike that resolves within the short window does not trigger the alert. A sustained problem that appears in both windows does.

Alerting Decision Framework

  Should this condition generate an alert?

                    +---------------------+
                    | Does it affect      |
                    | customer experience?|
                    +----------+----------+
                               |
                    +----------+----------+
                    |                     |
                   YES                    NO
                    |                     |
              +-----+------+       +-----+------+
              | Is it       |       | Will it     |
              | happening   |       | become      |
              | now?        |       | customer-   |
              +-----+-------+       | facing?     |
                    |               +-----+-------+
              +-----+-----+              |
              |           |        +-----+-----+
             YES          NO       |           |
              |           |       YES          NO
         +----+----+ +----+----+   |      +----+-----+
         | PAGE    | | Will   |   |      | Dashboard |
         | (wake   | | it get |   |      | only.     |
         | on-call)| | worse? |   |      | No alert. |
         +---------+ +---+----+   |      +----------+
                        |         |
                   +----+---+     |
                   |        |     |
                  YES       NO    |
                   |        |     |
              +----+---+ +--+----++
              | TICKET | | TICKET |
              | (fix   | | (fix   |
              | today) | | this   |
              +--------+ | week)  |
                          +-------+

Each alert rule should have a corresponding runbook entry. The runbook should specify:

  1. What the alert means in terms of customer impact
  2. The first diagnostic step (a specific dashboard link, not "check the logs")
  3. Common root causes ranked by frequency
  4. Escalation path if the on-call engineer cannot resolve within the expected timeframe

Dashboard Design

Required Dashboards Per Service

SLO Dashboard. Error budget remaining, burn rate trend, SLI measurements over 7-day and 28-day windows. Audience: engineering leadership, sprint planning.

RED Dashboard. Request rate, error rate, latency percentiles (p50, p95, p99) in real time. Audience: on-call engineers during incidents.

Infrastructure Dashboard. CPU, memory, disk I/O, network throughput for the service's compute resources. Audience: capacity planning.

Dependency Dashboard. Latency and error rates for every downstream service and data store. When a service degrades, this dashboard distinguishes between internal causes and dependency failures.

Dashboard Anti-Patterns

Anti-patternProblemFix
Single dashboard with 60+ panelsTakes 30-45s to load, nobody uses itSplit into focused dashboards
Using avg() for latencyHides tail latencyUse histogram_quantile() for p50/p95/p99
Auto-scaled Y-axis with no baselineNormal variation looks alarmingSet Y-axis min/max based on expected range
No deploy annotationsCannot correlate regressions with releasesAdd annotation source from CI/CD pipeline
Stale dashboards nobody viewsMaintenance burden, misleading dataAudit quarterly, archive unused dashboards

Deploy Annotations

Annotating dashboards with deployment events is the single most effective way to correlate service regressions with code changes. Most CI/CD systems can send a webhook to the Grafana annotations API.

// Post-deploy annotation to Grafana
async function annotateDeployment(service: string, version: string): Promise<void> {
  await fetch(`${GRAFANA_URL}/api/annotations`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: `Bearer ${GRAFANA_API_KEY}`,
    },
    body: JSON.stringify({
      dashboardUID: SERVICE_DASHBOARD_UID,
      time: Date.now(),
      tags: ['deploy', service, version],
      text: `Deployed ${service} v${version}`,
    }),
  });
}

Observability Maturity Levels

LevelCapabilityTooling exampleMTTR impact
0console.log, SSH into productionNoneHours to days
1Centralized loggingELK, Loki, CloudWatch1-2 hours
2Metrics dashboards and basic alertsGrafana + Prometheus30-60 minutes
3Distributed tracingJaeger, Tempo, Datadog APM10-30 minutes
4Correlated signals (logs, traces, metrics linked)Grafana + Tempo + Loki + Prometheus2-10 minutes
5SLO-driven alerting, error budgets, automated canary analysisFull stack + SLO tooling2-5 minutes

Most organizations operate at Level 2. The transition from Level 2 to Level 4 produces the largest improvement in incident response capability relative to implementation effort.

Operational Considerations

Storage Cost Management

Observability data is high-volume. Storage costs require active management.

SignalTypical volume (medium scale)90-day retention cost (cloud storage)
Logs500 GB/day$1,000-3,000/month (Loki/S3)
Metrics2M active series$50-150/month (Thanos/S3)
Traces50 GB/day (5% sampling)$100-300/month (Tempo/S3)

Cost reduction strategies:

  • Log sampling: Retain 100% of error/warn, sample debug at 1-10%. A 10% debug sampling rate can reduce log storage by 40-60% with minimal impact on debugging capability.
  • Trace sampling: Tail-based sampling at 5% with 100% error retention.
  • Metrics downsampling: 15-second resolution for recent data, 5-minute for >30 days, 1-hour for >90 days.
  • Retention tiering: Hot storage (SSD) for 7 days, warm (HDD/S3) for 90 days, archive (Glacier/Coldline) for compliance retention.

Prometheus Restart Behavior

When Prometheus restarts, a gap appears in time series data. If an alert rule uses a 5-minute rate() window and Prometheus was unavailable for 3 minutes, the first evaluation after restart may produce inaccurate results or no data. This can generate false "resolved" signals during active incidents.

Mitigations: use for clauses on alert rules to require sustained threshold breaches, run redundant Prometheus instances with Thanos deduplication, and configure Alertmanager to require explicit resolution rather than auto-resolving on data gaps.

Instrumentation Timing

Instrumentation should be deployed during normal operations, not during incidents. OpenTelemetry auto-instrumentation for HTTP, gRPC, and common database clients can be deployed in a single work session. The cost of deploying instrumentation proactively is a fraction of the cost of lacking it during an incident.

Validating Observability

Periodically inject known failures (increased latency, error responses, resource exhaustion) and verify that:

  1. Alerts fire within the expected timeframe
  2. Dashboards display the anomaly clearly
  3. Traces capture the failure path
  4. An on-call engineer can identify the root cause using only observability tooling, without SSH access or source code inspection

If any of these checks fail, the observability system has gaps that will surface during real incidents.

OpenTelemetry Collector Deployment

The OpenTelemetry Collector acts as a local aggregation and routing layer between application SDKs and backend storage.

  OTel Collector Deployment (Kubernetes):

  +------------------+     +------------------+     +------------------+
  | Pod: order-svc   |     | Pod: user-svc    |     | Pod: pay-svc     |
  | +------+  +----+ |     | +------+  +----+ |     | +------+  +----+ |
  | | App  |->|OTel| |     | | App  |->|OTel| |     | | App  |->|OTel| |
  | |      |  |SDK | |     | |      |  |SDK | |     | |      |  |SDK | |
  | +------+  +--+-+ |     | +------+  +--+-+ |     | +------+  +--+-+ |
  +---------------+--+     +---------------+--+     +---------------+--+
                  |                        |                        |
                  | OTLP/gRPC             | OTLP/gRPC             |
                  v                        v                        v
  +-----------------------------------------------------------------------+
  |                    OTel Collector (DaemonSet)                          |
  |                                                                       |
  |  Receivers:  OTLP (gRPC:4317, HTTP:4318)                             |
  |  Processors: memory_limiter, batch, tail_sampling                     |
  |  Exporters:  otlp/tempo (traces), prometheusremotewrite (metrics),   |
  |              loki (logs)                                              |
  +--+------------------+------------------+------------------------------+
     |                  |                  |
     v                  v                  v
  +------+         +----------+       +------+
  | Tempo |         | Prometheus|       | Loki |
  | (traces)        | / Mimir  |       | (logs)|
  +------+         | (metrics)|       +------+
                   +----------+
# OTel Collector Kubernetes DaemonSet (abbreviated)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.92.0
          args: ["--config=/etc/otel/config.yaml"]
          ports:
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
            - containerPort: 8888  # Collector metrics
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 2Gi
          volumeMounts:
            - name: config
              mountPath: /etc/otel
      volumes:
        - name: config
          configMap:
            name: otel-collector-config

Summary of Recommendations

AreaRecommendation
LoggingStructured JSON, trace ID in every entry, consistent field names
TracingOpenTelemetry SDK, tail-based sampling at 5%, 100% error retention
MetricsRED method, bounded cardinality (<10K series per metric), histograms over averages
AlertingSLO-based with multi-window burn rates, runbooks for every alert
DashboardsFour per service (SLO, RED, infrastructure, dependencies), deploy annotations
CorrelationExemplars on metrics, trace ID in logs, Grafana data source linking
StorageTiered retention, log sampling, metric downsampling
ValidationPeriodic failure injection to verify alerting and diagnostic workflows