Microservices Observability Architecture: The Three Pillars and Beyond
Build comprehensive observability for microservices with distributed tracing, metrics, logs, and the emerging eBPF-based observability using OpenTelemetry, Prometheus, Jaeger, and Grafana.
Table of Contents
Introduction
In the world of distributed systems, understanding what happens inside your applications has never been more challenging—or more critical. As organizations transition from monolithic architectures to microservices, traditional monitoring approaches fall short. You cannot simply look at a single server's CPU or memory to understand system health. Instead, you need observability—the ability to understand the internal state of your system by examining its external outputs.
This article explores the comprehensive architecture for implementing observability in microservices environments, covering:
- The Three Pillars - Metrics, Logs, and Traces
- OpenTelemetry - The unified standard for instrumentation
- Correlation Strategies - Connecting signals across pillars
- eBPF-Based Observability - The next frontier
- SLIs, SLOs, and Error Budgets - Reliability engineering
- Cost Optimization - Scaling observability economically
Key Insight: Observability is not just about collecting data—it's about asking arbitrary questions of your system without having to know those questions in advance.
The Three Pillars of Observability
The observability ecosystem is built on three complementary signals, each providing unique insights into system behavior:
| Pillar | What It Captures | Best For | Questions Answered |
|---|---|---|---|
| Metrics | Numeric measurements (counters, gauges, histograms) | Alerting, dashboards, capacity planning | What is happening? (Real-time health) |
| Logs | Discrete events with context | Debugging, audit, compliance | Why did it happen? (Root cause) |
| Traces | Request flow across services | Performance analysis, dependency mapping | Where did it happen? (Request flow) |
Understanding Each Pillar
Metrics provide aggregated, time-series data perfect for dashboards and alerting. They answer "what" questions with low cardinality.
Logs capture discrete events with rich context. They answer "why" questions when debugging issues.
Traces follow requests across service boundaries. They answer "where" questions about distributed flows.
Observability Platform Architecture
A production-grade observability platform requires careful orchestration of multiple components:
Microservices Architecture
Key Components
| Layer | Components | Purpose |
|---|---|---|
| Collection | OTel Collector, Prometheus Scraper, Fluent Bit | Gather telemetry from applications |
| Processing | Kafka, Spark Streaming | Buffer, enrich, and transform data |
| Storage | Prometheus, Thanos, Loki, Tempo | Store metrics, logs, and traces |
| Visualization | Grafana, Jaeger UI | Query and visualize data |
| Alerting | Alertmanager, PagerDuty | Notify on-call teams |
OpenTelemetry: The Unified Standard
OpenTelemetry has emerged as the industry standard for instrumenting applications. It provides vendor-neutral APIs, SDKs, and tools for generating telemetry data.
OpenTelemetry for AI/ML
OpenTelemetry Components
| Component | Function |
|---|---|
| API | Defines interfaces for instrumentation |
| SDK | Implements API with providers and exporters |
| Collector | Receives, processes, and exports telemetry |
| Auto-instrumentation | Zero-code instrumentation for frameworks |
Manual Instrumentation (Go)
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
)
var tracer trace.Tracer
func init() {
tracer = otel.Tracer("order-service")
}
func ProcessOrder(ctx context.Context, order Order) error {
ctx, span := tracer.Start(ctx, "ProcessOrder",
trace.WithAttributes(
attribute.String("order.id", order.ID),
attribute.String("order.customer_id", order.CustomerID),
attribute.Float64("order.total", order.Total),
),
)
defer span.End()
// Validate order
if err := validateOrder(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "validation failed")
return err
}
// Process payment (with child span)
if err := processPayment(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "payment failed")
return err
}
span.SetStatus(codes.Ok, "order processed successfully")
return nil
}
Context Propagation Across Services
// HTTP Client with context propagation
func callInventoryService(ctx context.Context, items []Item) error {
ctx, span := tracer.Start(ctx, "CheckInventory",
trace.WithSpanKind(trace.SpanKindClient),
)
defer span.End()
req, _ := http.NewRequestWithContext(ctx, "POST",
"http://inventory-service/check",
marshalItems(items))
// Inject trace context into HTTP headers
otel.GetTextMapPropagator().Inject(ctx,
propagation.HeaderCarrier(req.Header))
resp, err := httpClient.Do(req)
if err != nil {
span.RecordError(err)
return err
}
defer resp.Body.Close()
span.SetAttributes(attribute.Int("http.status_code", resp.StatusCode))
return nil
}
Collector Deployment Patterns
Collector Deployment Topology
| Pattern | Pros | Cons | Use Case |
|---|---|---|---|
| Sidecar | Isolation, per-app config | Resource overhead | Multi-tenant, strict isolation |
| DaemonSet | Efficient, shared resources | Shared config | Homogeneous workloads |
| Gateway | Centralized processing | Single point of failure | External traffic, edge processing |
| Combined | Best of both worlds | Complexity | Large-scale production |
Metrics Collection Pipeline
A robust metrics pipeline ensures reliable data collection and storage:
Prometheus Recording Rules
groups:
- name: order_service_rules
interval: 30s
rules:
# Pre-aggregate request rate by service
- record: service:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service, method, status_code)
# Calculate error rate
- record: service:http_errors:rate5m
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# P99 latency by service
- record: service:http_latency:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
)
Log Aggregation Architecture
Grafana Loki Architecture
Structured Logging Best Practices
@Service
public class OrderService {
private static final Logger log = LoggerFactory.getLogger(OrderService.class);
public Order processOrder(OrderRequest request) {
MDC.put("orderId", request.getOrderId());
MDC.put("customerId", request.getCustomerId());
MDC.put("traceId", Span.current().getSpanContext().getTraceId());
try {
log.info("Processing order",
kv("orderTotal", request.getTotal()),
kv("itemCount", request.getItems().size()),
kv("paymentMethod", request.getPaymentMethod())
);
Order order = createOrder(request);
log.info("Order processed successfully",
kv("orderId", order.getId()),
kv("status", order.getStatus()),
kv("processingTimeMs", order.getProcessingTime())
);
return order;
} catch (PaymentException e) {
log.error("Payment processing failed",
kv("errorCode", e.getErrorCode()),
kv("errorMessage", e.getMessage()),
e
);
throw e;
} finally {
MDC.clear();
}
}
}
Correlating Traces, Metrics, and Logs
The true power of observability emerges when you can seamlessly navigate between pillars using correlation keys:
| Correlation Key | Purpose |
|---|---|
| Trace ID | Links all signals for a single request |
| Service Name | Groups signals by service |
| Timestamp | Aligns signals in time windows |
| Request ID | Application-level correlation |
Exemplar-Based Correlation
// Recording exemplars in Go
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration with exemplars",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "path", "status"},
)
func recordRequestDuration(ctx context.Context, method, path, status string, duration float64) {
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceId().String()
// Record metric with exemplar containing trace ID
requestDuration.WithLabelValues(method, path, status).(prometheus.ExemplarObserver).
ObserveWithExemplar(duration, prometheus.Labels{
"traceID": traceID,
})
}
eBPF-Based Observability
eBPF (extended Berkeley Packet Filter) enables deep observability without application changes:
| Tool | Use Case | Data Captured |
|---|---|---|
| Cilium | Network observability | L3/L4/L7 metrics, network policies |
| Pixie | Auto-instrumentation | HTTP traces, SQL queries, no code changes |
| Tetragon | Security observability | Syscall audit, file access, process exec |
| bpftrace | Ad-hoc debugging | Custom kernel/user probes |
SLIs, SLOs, and Error Budgets
Service Level Indicators (SLIs) and Service Level Objectives (SLOs) translate observability data into actionable reliability targets:
| Component | Definition | Example |
|---|---|---|
| SLI | Measurable indicator of service level | P99 latency, error rate |
| SLO | Target value for an SLI | 99.9% requests < 200ms |
| Error Budget | Allowed unreliability | 0.1% = 43.2 min/month downtime |
SLO Configuration Example
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: order-service-slo
spec:
service: order-service
slos:
- name: availability
objective: 99.95
description: "Order service must be available 99.95% of time"
sli:
events:
errorQuery: |
sum(rate(http_requests_total{
service="order-service",
status_code=~"5.."
}[{{.window}}]))
totalQuery: |
sum(rate(http_requests_total{
service="order-service"
}[{{.window}}]))
Cost Optimization Strategies
| Strategy | Implementation | Cost Reduction |
|---|---|---|
| Label Management | Drop high-cardinality labels | 30% |
| Tiered Storage | Hot/Warm/Cold with Thanos | 50% |
| Log Filtering | Drop debug logs in production | 70% |
| Trace Sampling | Tail-based sampling for errors | 90% |
Tail-Based Sampling with OpenTelemetry
processors:
tail_sampling:
decision_wait: 10s
policies:
# Always sample errors
- name: error-policy
type: status_code
status_code:
status_codes: [ERROR]
# Always sample slow requests
- name: latency-policy
type: latency
latency:
threshold_ms: 2000
# Sample 10% of remaining traces
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
Conclusion
Building a comprehensive observability platform for microservices requires thoughtful architecture across all three pillars—metrics, logs, and traces. The key takeaways:
-
Standardize on OpenTelemetry - Vendor-neutral instrumentation provides flexibility and avoids lock-in
-
Correlate Everything - Use trace IDs, exemplars, and timestamps to seamlessly navigate between signals
-
Implement SLOs - Transform raw data into actionable reliability targets with error budgets
-
Adopt eBPF - Gain deep observability without application changes, especially for network and security
-
Optimize for Cost - Use sampling, aggregation, and tiered storage to manage costs at scale
-
Automate Response - Connect alerts to runbooks and automation for faster incident resolution