Monitoring answers whether a system is functioning. Observability answers why it is not. The distinction matters in distributed architectures where failures cascade across service boundaries and root causes are rarely local to the component that surfaces the error.
This document covers the core engineering practices for building observable systems: structured logging, distributed tracing, metrics with controlled cardinality, and SLO-based alerting. The tooling references center on the Grafana/Prometheus/OpenTelemetry ecosystem, though the principles apply regardless of vendor.
Signal Hierarchy
Logs, metrics, and traces are commonly described as the "three pillars of observability." In practice, their utility during incident response is not equal.
- Traces reveal the execution path of a specific request across services. They are the most useful signal for diagnosing latency and failure propagation.
- Metrics provide aggregate behavior over time. They are the primary mechanism for detecting anomalies and triggering alerts.
- Logs provide arbitrary detail for cases where traces and metrics lack sufficient context. At scale, log-based debugging without trace correlation is operationally expensive.
The debugging workflow differs substantially depending on which signals are available.
Debugging with logs only:
+---------+ +---------+ +----------+ +---------+
| Alert | --> | Search | --> | Find one | --> | Attempt |
| fires | | through | | relevant | | to |
| | | 500 GB | | log line | | identify|
| | | of logs | | (maybe) | | service |
+---------+ +---------+ +----------+ +---------+
Time: ~2 hours
Debugging with traces + metrics + logs:
+---------+ +-----------+ +---------+ +---------+
| Alert | --> | Open | --> | Click | --> | See the |
| fires | | trace for | | into | | exact |
| | | failing | | slow | | log |
| | | request | | span | | line |
+---------+ +-----------+ +---------+ +---------+
Time: ~5 minutes
The difference in mean time to resolution (MTTR) between these two workflows is typically an order of magnitude. Organizations that deploy distributed tracing commonly report MTTR reductions from hours to minutes.
Structured Logging
Unstructured log messages are effectively opaque to automated processing. When services number in the dozens or hundreds, each with different log formats, programmatic correlation becomes impractical.
// Unstructured: not machine-parseable, not correlatable
console.log(`User ${userId} placed order ${orderId} for $${amount}`);
// Structured: every field is queryable, filterable, and correlatable
logger.info({
event: 'order_placed',
userId,
orderId,
amount,
currency: 'USD',
paymentMethod: 'stripe',
traceId: context.traceId,
spanId: context.spanId,
latencyMs: Date.now() - startTime,
});Structured Logging Requirements
-
Every log entry must include a trace ID. Without trace-to-log correlation, there is no way to associate a log entry with the distributed trace that produced it. Grafana Loki and Tempo support native linking between log entries and traces via the trace ID field.
-
Field names must be consistent across all services. If one service uses
userId, another usesuser_id, and a third usesuid, cross-service queries in Loki require unions across all variants. Standardize field names in a shared schema document and enforce compliance through linting or code review. -
Log at service boundaries, not inside tight loops. Placing
logger.debug()inside a loop iterating over 100,000 items generates log volume that can saturate the log aggregation pipeline. Log on request entry, request exit, and error conditions. -
Include business context. A log entry reading "Connection timeout after 30s" is less actionable than "Connection timeout after 30s, order_id=ord_8x7k2m, user_id=usr_4n9p1q, region=us-east-1."
Logging Levels and Their Intended Use
| Level | Purpose | Production sampling |
|---|---|---|
error | Unrecoverable failures requiring attention | 100%, always retained |
warn | Degraded operation, recoverable conditions | 100%, always retained |
info | Request lifecycle events, state transitions | 100% or sampled at 50% |
debug | Detailed internal state for development use | Sampled at 1-10% |
trace | Extremely verbose, per-iteration output | Disabled in production |
Log Aggregation Pipeline Architecture
Log Aggregation Flow:
+-------------+ +-------------+ +-----------+
| Service A |---->| | | |
| (stdout) | | | | |
+-------------+ | Fluent Bit | | Loki |
| or |---->| (or |
+-------------+ | Fluentd | | Elastic) |
| Service B |---->| | | |
| (stdout) | | (shipper) | | (store) |
+-------------+ +------+------+ +-----+-----+
| |
+-------------+ | +-----+-----+
| Service C |------+-----+ | Grafana |
| (stdout) | | (query) |
+-------------+ +-----------+
Failure modes at each stage:
1. Service: Excessive log volume --> disk pressure, OOM
2. Shipper: Buffer overflow --> dropped log entries
3. Network: Backpressure --> shipper blocks, app slows
4. Store: Ingestion rate exceeded --> HTTP 429, data loss
5. Query: Expensive query --> timeout, no results
A known failure pattern: the log shipper (Fluent Bit) buffers to disk when the storage backend (Loki) returns HTTP 429 rate-limit responses. If disk buffer limits are not configured, the buffer fills the node disk, triggering kubelet eviction of pods, which generates additional log volume, increasing backpressure further. This is a positive feedback loop. The mitigation is to configure hard disk buffer limits on the shipper so it drops logs rather than filling the disk. Dropping debug-level logs under backpressure is preferable to node-level disruption.
Fluent Bit Configuration for Kubernetes
# fluent-bit configmap
[SERVICE]
Flush 5
Log_Level info
Daemon off
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/flb-storage/
storage.sync normal
storage.backlog.mem_limit 50M
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser cri
DB /var/log/flb_kube.db
Mem_Buf_Limit 10MB
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
K8S-Logging.Parser On
[FILTER]
Name grep
Match kube.*
Exclude log ^$/
[OUTPUT]
Name loki
Match kube.*
Host loki-gateway.monitoring.svc
Port 80
Labels job=fluent-bit, namespace=$kubernetes['namespace_name'], pod=$kubernetes['pod_name']
Label_Keys $level
BatchWait 1
BatchSize 1048576
LineFormat json
AutoKubernetesLabels offDistributed Tracing Architecture
A trace represents a single request's path through a distributed system. Each unit of work is represented as a span. Spans have parent-child relationships that form a directed acyclic graph, typically visualized as a waterfall.
Trace Waterfall: POST /api/orders
Trace ID: 7f3a8b2c-d4e5-4f6a-b7c8-9d0e1f2a3b4c
|<---------- 312ms total ---------->|
| |
[API Gateway] POST /orders ============================== 312ms
|
|- [Auth Service] validateToken === 12ms
| |
| '- [Redis] GET session:tok_9x2 = 1ms
|
|- [Order Service] createOrder ======================= 280ms
| |
| |- [PostgreSQL] INSERT orders ======= 28ms
| |
| |- [Inventory Service] reserveItems ========= 62ms
| | |
| | |- [PostgreSQL] SELECT inventory ==== 18ms
| | |
| | '- [Redis] DECR inventory:sku_841 = 2ms
| |
| '- [Payment Service] charge ================= 180ms
| |
| '- [Stripe API] POST /v1/charges ===== 172ms
| External API call.
| Dominates total latency.
| Options: caching, async
| processing, timeout tuning.
|
'- [Notification Service] sendConfirmation = 5ms (async)
|
'- [SES] SendEmail = 45ms (async, not on critical path)
This waterfall immediately identifies that 172ms of the 312ms total is spent in the Stripe API call. The payment service itself adds only 8ms of overhead. This distinction, whether the bottleneck is internal or external, is visible in seconds rather than requiring log correlation across multiple services.
OpenTelemetry SDK Setup
OpenTelemetry is the CNCF standard for instrumentation. It provides a vendor-neutral API and SDK with exporters for Jaeger, Tempo, Zipkin, and commercial backends.
// tracing.ts - OpenTelemetry initialization
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import {
SEMRESATTRS_SERVICE_NAME,
SEMRESATTRS_SERVICE_VERSION,
SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,
} from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'order-service',
[SEMRESATTRS_SERVICE_VERSION]: '1.4.2',
[SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: 'grpc://otel-collector.monitoring.svc:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'grpc://otel-collector.monitoring.svc:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/healthz', '/readyz', '/metrics'],
},
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
'@opentelemetry/instrumentation-grpc': { enabled: true },
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown().then(() => process.exit(0));
});Manual Span Creation
Auto-instrumentation covers HTTP, gRPC, and database calls. Business logic operations require manual spans.
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service', '1.4.2');
async function createOrder(req: OrderRequest): Promise<Order> {
return tracer.startActiveSpan('createOrder', async (span) => {
try {
span.setAttribute('order.items_count', req.items.length);
span.setAttribute('order.total_cents', req.totalCents);
span.setAttribute('order.currency', req.currency);
span.setAttribute('user.id', req.userId);
span.setAttribute('user.tier', req.userTier);
const order = await db.insert('orders', req);
span.setAttribute('order.id', order.id);
await tracer.startActiveSpan('reserveInventory', async (invSpan) => {
invSpan.setAttribute('inventory.sku_count', req.items.length);
for (const item of req.items) {
invSpan.addEvent('reserving_item', {
'item.sku': item.sku,
'item.quantity': item.quantity,
});
}
await inventoryService.reserve(req.items);
invSpan.end();
});
await tracer.startActiveSpan('processPayment', async (paySpan) => {
paySpan.setAttribute('payment.method', req.paymentMethod);
paySpan.setAttribute('payment.provider', 'stripe');
const charge = await paymentService.charge(req);
paySpan.setAttribute('payment.charge_id', charge.id);
paySpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}Context Propagation
Tracing requires trace context to propagate across service boundaries. Without propagation, traces fragment into disconnected spans. OpenTelemetry propagates context automatically for HTTP calls via the W3C traceparent header. Other transport mechanisms require explicit handling.
Context Propagation Across Transports:
Service A Service B Service C
+--------+ +--------+ +--------+
| Span 1 |--- HTTP req ---->| Span 2 |--- gRPC req ---->| Span 3 |
| | traceparent: | | traceparent: | |
| | 00-7f3a8b2c- | | 00-7f3a8b2c- | |
| | span1id-01 | | span2id-01 | |
+--------+ +--------+ +--------+
|
| Kafka message
| +----------------------+
| | headers: |
| | traceparent: ... |
| | body: { ... } |
| +----------------------+
| |
| v
| +--------+
| | Worker |
| | Span 4 | <-- same trace ID, new span ID
| +--------+
|
All spans share trace ID 7f3a8b2c.
Transport-specific propagation requirements:
- Kafka: Embed trace context in Kafka message headers. The
@opentelemetry/instrumentation-kafkajspackage handles this automatically for KafkaJS. - SQS/SNS: Place trace context in message attributes. AWS X-Ray headers can coexist with W3C traceparent.
- Cron jobs and async workers: No incoming request context exists. Create a new root span. If the job was triggered by a prior request, use span links to associate the new trace with the originating trace.
- Thread/goroutine pools: Context does not transfer across threads automatically. In Node.js,
AsyncLocalStoragehandles this when the OTel SDK is initialized before application code.
Sampling Strategies
At non-trivial request volumes, storing 100% of traces is neither economical nor necessary. A service processing 10,000 requests per second at an average of 5 spans per trace with 1 KB per span generates approximately 50 MB/s of trace data, or 4.3 TB per day.
Sampling strategies, ordered by sophistication:
Head-based (probabilistic) sampling. The sampling decision is made at the root span and propagated to all child spans. Simple to implement. The drawback is that rare events (errors, high-latency outliers) are sampled out at the same rate as normal traffic.
Tail-based sampling. The OpenTelemetry Collector buffers complete traces and applies sampling rules after the trace finishes. This enables policies such as "keep 100% of error traces, 100% of traces exceeding p99 latency, and 5% of successful traces." The tradeoff is operational complexity: the collector must buffer traces in memory, which requires sizing for peak throughput.
Always-sample errors. Regardless of the base sampling rate, retain 100% of traces containing error spans. This is the minimum viable sampling policy for production systems.
A practical configuration: 5% head-based sampling for normal traffic, 100% sampling for errors and requests exceeding the p99 latency threshold.
# OpenTelemetry Collector configuration with tail-based sampling
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 5000
policies:
- name: keep-all-errors
type: status_code
status_code:
status_codes:
- ERROR
- name: keep-slow-traces
type: latency
latency:
threshold_ms: 2000
- name: probabilistic-sample
type: probabilistic
probabilistic:
sampling_percentage: 5
batch:
timeout: 5s
send_batch_size: 8192
send_batch_max_size: 16384
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
exporters:
otlp/tempo:
endpoint: tempo-distributor.monitoring.svc:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo]Metrics and Cardinality
Metrics provide aggregate views of system behavior over time. They are the primary input for dashboards and alerts. The critical constraint is cardinality: every unique combination of label values creates a distinct time series.
Cardinality Math
Consider a metric with labels {method, endpoint, status_bucket, region}:
| Label | Cardinality | Example values |
|---|---|---|
| method | 5 | GET, POST, PUT, DELETE, PATCH |
| endpoint | 50 | /api/orders, /api/users, ... |
| status_bucket | 5 | 2xx, 3xx, 4xx, 5xx, timeout |
| region | 4 | us-east-1, us-west-2, eu-west-1, ap-southeast-1 |
Total time series: 5 x 50 x 5 x 4 = 5,000. This is manageable for Prometheus.
Adding userId with 6 million unique values: 5 x 50 x 5 x 4 x 6,000,000 = 30 billion time series. Prometheus will be OOM-killed. This is the single most common way to take down a Prometheus instance.
// DANGEROUS: unbounded cardinality
// userId has millions of unique values
counter.add(1, {
endpoint: '/api/orders',
userId: req.userId, // 6M unique values
});
// Result: millions of time series, Prometheus OOM
// SAFE: bounded cardinality
counter.add(1, {
endpoint: '/api/orders',
method: 'POST',
status: `${Math.floor(res.statusCode / 100)}xx`,
region: 'us-east-1',
});
// Result: ~1000 time series totalGeneral guidance: keep total time series per metric below 10,000. Keep total active time series per Prometheus instance below 5 million for reliable performance. Above that, consider sharding with Thanos, Cortex, or Mimir.
High-cardinality identifiers (user ID, request ID, order ID) belong in trace attributes and log fields, not in metric labels.
The RED Method
For service-level metrics, the RED method provides the essential signals:
- Rate: requests per second
- Errors: error rate as a percentage of total requests
- Duration: latency distribution (histogram, not average)
Averages hide tail latency. An average latency of 100ms could mean uniform 100ms responses, or it could mean 99% of responses at 10ms and 1% at 9,100ms. The 1% tail is invisible in the average but represents a significant user experience degradation. Always use histograms and query percentiles (p50, p95, p99).
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service');
const requestDuration = meter.createHistogram('http_request_duration_ms', {
description: 'HTTP request duration in milliseconds',
unit: 'ms',
advice: {
explicitBucketBoundaries: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
},
});
const requestCounter = meter.createCounter('http_requests_total', {
description: 'Total HTTP requests',
});
const activeRequests = meter.createUpDownCounter('http_active_requests', {
description: 'Currently active HTTP requests',
});
// Express middleware
app.use((req, res, next) => {
const start = performance.now();
activeRequests.add(1, { method: req.method });
res.on('finish', () => {
const duration = performance.now() - start;
const labels = {
method: req.method,
route: req.route?.path || 'unknown',
status: `${Math.floor(res.statusCode / 100)}xx`,
};
requestDuration.record(duration, labels);
requestCounter.add(1, labels);
activeRequests.add(-1, { method: req.method });
});
next();
});Histogram Bucket Selection
Bucket boundaries should align with SLO thresholds. If the SLO specifies "99% of requests complete in under 500ms," buckets around that boundary need sufficient resolution.
Default buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
Bucket count tradeoff:
+------------------+--------------------+-----------------------+
| Fewer buckets | Lower cardinality | Less accurate |
| (e.g., 5) | ~5x time series | quantile estimates |
+------------------+--------------------+-----------------------+
| More buckets | Higher cardinality | More accurate |
| (e.g., 20) | ~20x time series | quantile estimates |
+------------------+--------------------+-----------------------+
For a histogram with 10 buckets and 100 label combinations:
Time series = 100 * (10 buckets + 2 for _sum and _count) = 1,200
For native histograms (Prometheus 2.40+):
Time series = 100 * 1 = 100 (buckets stored within the series)
Prometheus native histograms (available from version 2.40) store bucket data within a single time series rather than creating one series per bucket. This dramatically reduces cardinality for histogram metrics. Grafana Mimir and Tempo also support native histograms.
PromQL Reference
Common patterns and their correct usage:
rate() vs irate(): rate() computes the per-second average rate over the entire range window. irate() computes the instantaneous rate using only the last two data points. Use rate() for alerting (smoother signal, fewer false positives). Use irate() for dashboards where short-lived spikes should be visible.
Aggregation order: Apply rate() before sum(). The expression sum(rate(requests_total[5m])) is correct. The expression rate(sum(requests_total)[5m]) does not produce valid results because sum() creates a synthetic series that rate() cannot interpret as a monotonic counter.
Histogram quantiles: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) produces an estimate whose accuracy depends on bucket boundary placement. If the true p99 falls between widely spaced buckets, the estimate will be imprecise due to linear interpolation.
# Recording rules: pre-compute expensive PromQL expressions
groups:
- name: service_slo_rules
interval: 30s
rules:
- record: job:http_request_duration_seconds:p50
expr: >
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: job:http_request_duration_seconds:p95
expr: >
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: job:http_request_duration_seconds:p99
expr: >
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
- record: job:http_request_errors:ratio_rate5m
expr: >
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
- record: job:http_requests:rate5m
expr: >
sum(rate(http_requests_total[5m])) by (job)Prometheus Scaling Topology
A single Prometheus instance handles approximately 1-2 million active time series with default resource allocations. Beyond that, or when multi-region aggregation is required, a federation or remote-write architecture is necessary.
Prometheus Scaling with Thanos:
Region: us-east-1 Region: eu-west-1
+-------------------+ +-------------------+
| Prometheus | | Prometheus |
| (local scrape) | | (local scrape) |
| | | |
| Thanos Sidecar | | Thanos Sidecar |
+--------+----------+ +--------+----------+
| |
| remote write / StoreAPI |
v v
+--------+------------------------------+----------+
| Thanos Querier |
| (global query layer) |
| |
| +-------------+ +-----------+ |
| | Thanos | | Thanos | |
| | Store | | Compactor | |
| | Gateway | | | |
| +------+------+ +-----+-----+ |
| | | |
+---------+-----------------+------------------------+
| |
v v
+---------+-----------------+---------+
| Object Storage (S3/GCS) |
| |
| - Long-term retention (years) |
| - Downsampled data for old ranges |
| - Cost: ~$0.023/GB/month (S3) |
+-------------------------------------+
|
v
+---------+---------+
| Grafana |
| (query Thanos |
| Querier) |
+-------------------+
Storage cost estimates for metrics:
| Scale | Active series | Ingestion rate | 90-day S3 cost |
|---|---|---|---|
| Small | 500K | ~50K samples/s | ~$15/month |
| Medium | 2M | ~200K samples/s | ~$60/month |
| Large | 10M | ~1M samples/s | ~$300/month |
| Very large | 50M+ | ~5M samples/s | ~$1,500/month |
These estimates assume Thanos compaction with downsampling enabled (5-minute resolution after 30 days, 1-hour resolution after 90 days).
Exemplars: Bridging Metrics and Traces
Exemplars attach a trace ID to a specific metric observation. When viewing a metric in Grafana, individual data points can link directly to the trace that produced them.
Exemplar Flow:
Grafana Dashboard Trace Backend (Tempo)
+-----------------------------+
| p99 Latency |
| * | Click the +------------------+
| / \ | --- exemplar -----> | Trace 7f3a8b2c |
| * * / \ | | Waterfall shows |
| / \ / \ * / * | | exact request |
| / \/ \ / \ / | | that caused the |
| / \/ \/ | | latency spike |
+-----------------------------+ +------------------+
Requirements:
1. Metrics SDK records exemplars (OpenTelemetry does this)
2. Storage supports exemplars (Prometheus 2.26+)
3. Visualization renders exemplars (Grafana 8.0+)
When all three components support exemplars, the workflow from aggregate metric anomaly to specific request trace requires a single click. This is the defining capability that separates correlated observability from disconnected signal collection.
Correlation: The Complete Signal Path
The full observability workflow connects all three signal types in a single investigation path.
The signal correlation path:
+----------+ +-----------+ +-----------+ +--------+
| Alert: | | Grafana | | Tempo | | Loki |
| "p99 > |---->| Dashboard |---->| Trace |---->| Logs |
| 500ms" | | (click | | Waterfall | | for |
| | | exemplar)| | (click | | trace |
| | | | | span) | | 7f3a.. |
+----------+ +-----------+ +-----------+ +--------+
Time from alert to root cause: ~2 minutes
Requirements for this workflow:
1. Trace IDs in log entries (structured logging)
2. Exemplars on metrics (OpenTelemetry SDK)
3. Data source linking in Grafana (Tempo <-> Loki, Tempo <-> Prometheus)
4. Consistent label mapping (service_name, namespace) across all signals
Grafana data source configuration for correlation:
# Grafana provisioning: datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://thanos-querier.monitoring.svc:9090
jsonData:
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
urlDisplayLabel: View Trace
- name: Tempo
type: tempo
uid: tempo
url: http://tempo-query-frontend.monitoring.svc:3100
jsonData:
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: false
mapTagNamesEnabled: true
mappedTags:
- key: service.name
value: service_name
tracesToMetrics:
datasourceUid: prometheus
queries:
- name: Request rate
query: sum(rate(http_requests_total{service_name="${__span.tags.service.name}"}[5m]))
- name: Error rate
query: sum(rate(http_requests_total{service_name="${__span.tags.service.name}", status=~"5.."}[5m]))
- name: Loki
type: loki
uid: loki
url: http://loki-gateway.monitoring.svc:80
jsonData:
derivedFields:
- name: TraceID
matcherRegex: '"traceId":"(\w+)"'
url: '$${__value.raw}'
datasourceUid: tempoSLO-Based Alerting
Infrastructure-centric alerts (CPU utilization, memory usage, disk space) are symptoms, not problems. A system can operate at 95% CPU utilization with zero customer impact. Conversely, a deadlocked process uses minimal CPU while causing a complete outage.
SLO-based alerting measures customer-facing impact directly and alerts when the system is consuming its error budget faster than the budget period allows.
Error Budget Calculation
For a 99.9% availability SLO over a 30-day window:
- Error budget: 0.1% of total requests
- At 1,000 requests/minute: 1,440,000 requests/day, 43,200,000 requests/month
- Budget: 43,200 failed requests per month
- Equivalent downtime: approximately 43 minutes per month
Multi-Window, Multi-Burn-Rate Alerts
The multi-burn-rate approach, described in the Google SRE Workbook, alerts at different severity levels based on how fast the error budget is being consumed.
# SLO: 99.9% of requests succeed (non-5xx) within 500ms
# Error budget: 0.1%
groups:
- name: slo_alerts
rules:
# FAST BURN: 14.4x budget consumption rate
# At this rate, monthly budget exhausted in ~50 hours
# Severity: page (wake someone up)
- alert: HighErrorBudgetBurn_Critical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (0.001 * 14.4)
AND
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (0.001 * 14.4)
for: 2m
labels:
severity: page
annotations:
summary: "Error budget burn rate 14.4x sustainable rate"
impact: "Monthly error budget will be exhausted in ~50 hours"
runbook_url: "https://runbooks.internal/slo-burn-critical"
# MEDIUM BURN: 6x budget consumption rate
# At this rate, monthly budget exhausted in ~5 days
# Severity: ticket (fix during business hours)
- alert: HighErrorBudgetBurn_Warning
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > (0.001 * 6)
AND
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
) > (0.001 * 6)
for: 15m
labels:
severity: ticket
annotations:
summary: "Elevated error budget burn rate (6x)"
impact: "Monthly error budget will be exhausted in ~5 days"
runbook_url: "https://runbooks.internal/slo-burn-warning"
# SLOW BURN: 3x budget consumption rate
# At this rate, monthly budget exhausted in ~10 days
# Severity: ticket (fix this week)
- alert: HighErrorBudgetBurn_Slow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1d]))
/ sum(rate(http_requests_total[1d]))
) > (0.001 * 3)
AND
(
sum(rate(http_requests_total{status=~"5.."}[2h]))
/ sum(rate(http_requests_total[2h]))
) > (0.001 * 3)
for: 30m
labels:
severity: ticket
annotations:
summary: "Slow error budget burn detected (3x)"
impact: "Monthly error budget will be exhausted in ~10 days"
runbook_url: "https://runbooks.internal/slo-burn-slow"
# Latency SLO: 99% of requests under 500ms
- alert: LatencySLOBreach
expr: |
(
1 - (
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m]))
)
) > 0.01
for: 5m
labels:
severity: page
annotations:
summary: "Latency SLO breach: >1% of requests exceeding 500ms"The dual-window approach (long window AND short window) reduces false positives. A brief spike that resolves within the short window does not trigger the alert. A sustained problem that appears in both windows does.
Alerting Decision Framework
Should this condition generate an alert?
+---------------------+
| Does it affect |
| customer experience?|
+----------+----------+
|
+----------+----------+
| |
YES NO
| |
+-----+------+ +-----+------+
| Is it | | Will it |
| happening | | become |
| now? | | customer- |
+-----+-------+ | facing? |
| +-----+-------+
+-----+-----+ |
| | +-----+-----+
YES NO | |
| | YES NO
+----+----+ +----+----+ | +----+-----+
| PAGE | | Will | | | Dashboard |
| (wake | | it get | | | only. |
| on-call)| | worse? | | | No alert. |
+---------+ +---+----+ | +----------+
| |
+----+---+ |
| | |
YES NO |
| | |
+----+---+ +--+----++
| TICKET | | TICKET |
| (fix | | (fix |
| today) | | this |
+--------+ | week) |
+-------+
Each alert rule should have a corresponding runbook entry. The runbook should specify:
- What the alert means in terms of customer impact
- The first diagnostic step (a specific dashboard link, not "check the logs")
- Common root causes ranked by frequency
- Escalation path if the on-call engineer cannot resolve within the expected timeframe
Dashboard Design
Required Dashboards Per Service
SLO Dashboard. Error budget remaining, burn rate trend, SLI measurements over 7-day and 28-day windows. Audience: engineering leadership, sprint planning.
RED Dashboard. Request rate, error rate, latency percentiles (p50, p95, p99) in real time. Audience: on-call engineers during incidents.
Infrastructure Dashboard. CPU, memory, disk I/O, network throughput for the service's compute resources. Audience: capacity planning.
Dependency Dashboard. Latency and error rates for every downstream service and data store. When a service degrades, this dashboard distinguishes between internal causes and dependency failures.
Dashboard Anti-Patterns
| Anti-pattern | Problem | Fix |
|---|---|---|
| Single dashboard with 60+ panels | Takes 30-45s to load, nobody uses it | Split into focused dashboards |
Using avg() for latency | Hides tail latency | Use histogram_quantile() for p50/p95/p99 |
| Auto-scaled Y-axis with no baseline | Normal variation looks alarming | Set Y-axis min/max based on expected range |
| No deploy annotations | Cannot correlate regressions with releases | Add annotation source from CI/CD pipeline |
| Stale dashboards nobody views | Maintenance burden, misleading data | Audit quarterly, archive unused dashboards |
Deploy Annotations
Annotating dashboards with deployment events is the single most effective way to correlate service regressions with code changes. Most CI/CD systems can send a webhook to the Grafana annotations API.
// Post-deploy annotation to Grafana
async function annotateDeployment(service: string, version: string): Promise<void> {
await fetch(`${GRAFANA_URL}/api/annotations`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${GRAFANA_API_KEY}`,
},
body: JSON.stringify({
dashboardUID: SERVICE_DASHBOARD_UID,
time: Date.now(),
tags: ['deploy', service, version],
text: `Deployed ${service} v${version}`,
}),
});
}Observability Maturity Levels
| Level | Capability | Tooling example | MTTR impact |
|---|---|---|---|
| 0 | console.log, SSH into production | None | Hours to days |
| 1 | Centralized logging | ELK, Loki, CloudWatch | 1-2 hours |
| 2 | Metrics dashboards and basic alerts | Grafana + Prometheus | 30-60 minutes |
| 3 | Distributed tracing | Jaeger, Tempo, Datadog APM | 10-30 minutes |
| 4 | Correlated signals (logs, traces, metrics linked) | Grafana + Tempo + Loki + Prometheus | 2-10 minutes |
| 5 | SLO-driven alerting, error budgets, automated canary analysis | Full stack + SLO tooling | 2-5 minutes |
Most organizations operate at Level 2. The transition from Level 2 to Level 4 produces the largest improvement in incident response capability relative to implementation effort.
Operational Considerations
Storage Cost Management
Observability data is high-volume. Storage costs require active management.
| Signal | Typical volume (medium scale) | 90-day retention cost (cloud storage) |
|---|---|---|
| Logs | 500 GB/day | $1,000-3,000/month (Loki/S3) |
| Metrics | 2M active series | $50-150/month (Thanos/S3) |
| Traces | 50 GB/day (5% sampling) | $100-300/month (Tempo/S3) |
Cost reduction strategies:
- Log sampling: Retain 100% of error/warn, sample debug at 1-10%. A 10% debug sampling rate can reduce log storage by 40-60% with minimal impact on debugging capability.
- Trace sampling: Tail-based sampling at 5% with 100% error retention.
- Metrics downsampling: 15-second resolution for recent data, 5-minute for >30 days, 1-hour for >90 days.
- Retention tiering: Hot storage (SSD) for 7 days, warm (HDD/S3) for 90 days, archive (Glacier/Coldline) for compliance retention.
Prometheus Restart Behavior
When Prometheus restarts, a gap appears in time series data. If an alert rule uses a 5-minute rate() window and Prometheus was unavailable for 3 minutes, the first evaluation after restart may produce inaccurate results or no data. This can generate false "resolved" signals during active incidents.
Mitigations: use for clauses on alert rules to require sustained threshold breaches, run redundant Prometheus instances with Thanos deduplication, and configure Alertmanager to require explicit resolution rather than auto-resolving on data gaps.
Instrumentation Timing
Instrumentation should be deployed during normal operations, not during incidents. OpenTelemetry auto-instrumentation for HTTP, gRPC, and common database clients can be deployed in a single work session. The cost of deploying instrumentation proactively is a fraction of the cost of lacking it during an incident.
Validating Observability
Periodically inject known failures (increased latency, error responses, resource exhaustion) and verify that:
- Alerts fire within the expected timeframe
- Dashboards display the anomaly clearly
- Traces capture the failure path
- An on-call engineer can identify the root cause using only observability tooling, without SSH access or source code inspection
If any of these checks fail, the observability system has gaps that will surface during real incidents.
OpenTelemetry Collector Deployment
The OpenTelemetry Collector acts as a local aggregation and routing layer between application SDKs and backend storage.
OTel Collector Deployment (Kubernetes):
+------------------+ +------------------+ +------------------+
| Pod: order-svc | | Pod: user-svc | | Pod: pay-svc |
| +------+ +----+ | | +------+ +----+ | | +------+ +----+ |
| | App |->|OTel| | | | App |->|OTel| | | | App |->|OTel| |
| | | |SDK | | | | | |SDK | | | | | |SDK | |
| +------+ +--+-+ | | +------+ +--+-+ | | +------+ +--+-+ |
+---------------+--+ +---------------+--+ +---------------+--+
| | |
| OTLP/gRPC | OTLP/gRPC |
v v v
+-----------------------------------------------------------------------+
| OTel Collector (DaemonSet) |
| |
| Receivers: OTLP (gRPC:4317, HTTP:4318) |
| Processors: memory_limiter, batch, tail_sampling |
| Exporters: otlp/tempo (traces), prometheusremotewrite (metrics), |
| loki (logs) |
+--+------------------+------------------+------------------------------+
| | |
v v v
+------+ +----------+ +------+
| Tempo | | Prometheus| | Loki |
| (traces) | / Mimir | | (logs)|
+------+ | (metrics)| +------+
+----------+
# OTel Collector Kubernetes DaemonSet (abbreviated)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.92.0
args: ["--config=/etc/otel/config.yaml"]
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8888 # Collector metrics
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 2Gi
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-collector-configSummary of Recommendations
| Area | Recommendation |
|---|---|
| Logging | Structured JSON, trace ID in every entry, consistent field names |
| Tracing | OpenTelemetry SDK, tail-based sampling at 5%, 100% error retention |
| Metrics | RED method, bounded cardinality (<10K series per metric), histograms over averages |
| Alerting | SLO-based with multi-window burn rates, runbooks for every alert |
| Dashboards | Four per service (SLO, RED, infrastructure, dependencies), deploy annotations |
| Correlation | Exemplars on metrics, trace ID in logs, Grafana data source linking |
| Storage | Tiered retention, log sampling, metric downsampling |
| Validation | Periodic failure injection to verify alerting and diagnostic workflows |