Microservices Observability Architecture: The Three Pillars and Beyond

Introduction

In the world of distributed systems, understanding what happens inside your applications has never been more challenging—or more critical. As organizations transition from monolithic architectures to microservices, traditional monitoring approaches fall short. You cannot simply look at a single server's CPU or memory to understand system health. Instead, you need observability—the ability to understand the internal state of your system by examining its external outputs.

This article explores the comprehensive architecture for implementing observability in microservices environments, covering:

The Three Pillars - Metrics, Logs, and Traces
OpenTelemetry - The unified standard for instrumentation
Correlation Strategies - Connecting signals across pillars
eBPF-Based Observability - The next frontier
SLIs, SLOs, and Error Budgets - Reliability engineering
Cost Optimization - Scaling observability economically

Key Insight: Observability is not just about collecting data—it's about asking arbitrary questions of your system without having to know those questions in advance.

The Three Pillars of Observability

The observability ecosystem is built on three complementary signals, each providing unique insights into system behavior:

Pillar	What It Captures	Best For	Questions Answered
Metrics	Numeric measurements (counters, gauges, histograms)	Alerting, dashboards, capacity planning	What is happening? (Real-time health)
Logs	Discrete events with context	Debugging, audit, compliance	Why did it happen? (Root cause)
Traces	Request flow across services	Performance analysis, dependency mapping	Where did it happen? (Request flow)

Understanding Each Pillar

Metrics provide aggregated, time-series data perfect for dashboards and alerting. They answer "what" questions with low cardinality.

Logs capture discrete events with rich context. They answer "why" questions when debugging issues.

Traces follow requests across service boundaries. They answer "where" questions about distributed flows.

Observability Platform Architecture

A production-grade observability platform requires careful orchestration of multiple components:

Microservices Architecture

Key Components

Layer	Components	Purpose
Collection	OTel Collector, Prometheus Scraper, Fluent Bit	Gather telemetry from applications
Processing	Kafka, Spark Streaming	Buffer, enrich, and transform data
Storage	Prometheus, Thanos, Loki, Tempo	Store metrics, logs, and traces
Visualization	Grafana, Jaeger UI	Query and visualize data
Alerting	Alertmanager, PagerDuty	Notify on-call teams

OpenTelemetry: The Unified Standard

OpenTelemetry has emerged as the industry standard for instrumenting applications. It provides vendor-neutral APIs, SDKs, and tools for generating telemetry data.

OpenTelemetry for AI/ML

OpenTelemetry Components

Component	Function
API	Defines interfaces for instrumentation
SDK	Implements API with providers and exporters
Collector	Receives, processes, and exports telemetry
Auto-instrumentation	Zero-code instrumentation for frameworks

Manual Instrumentation (Go)

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/trace"
)

var tracer trace.Tracer

func init() {
    tracer = otel.Tracer("order-service")
}

func ProcessOrder(ctx context.Context, order Order) error {
    ctx, span := tracer.Start(ctx, "ProcessOrder",
        trace.WithAttributes(
            attribute.String("order.id", order.ID),
            attribute.String("order.customer_id", order.CustomerID),
            attribute.Float64("order.total", order.Total),
        ),
    )
    defer span.End()

    // Validate order
    if err := validateOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "validation failed")
        return err
    }

    // Process payment (with child span)
    if err := processPayment(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "payment failed")
        return err
    }

    span.SetStatus(codes.Ok, "order processed successfully")
    return nil
}

Context Propagation Across Services

// HTTP Client with context propagation
func callInventoryService(ctx context.Context, items []Item) error {
    ctx, span := tracer.Start(ctx, "CheckInventory",
        trace.WithSpanKind(trace.SpanKindClient),
    )
    defer span.End()

    req, _ := http.NewRequestWithContext(ctx, "POST",
        "http://inventory-service/check",
        marshalItems(items))

    // Inject trace context into HTTP headers
    otel.GetTextMapPropagator().Inject(ctx,
        propagation.HeaderCarrier(req.Header))

    resp, err := httpClient.Do(req)
    if err != nil {
        span.RecordError(err)
        return err
    }
    defer resp.Body.Close()

    span.SetAttributes(attribute.Int("http.status_code", resp.StatusCode))
    return nil
}

Collector Deployment Patterns

Collector Deployment Topology

Pattern	Pros	Cons	Use Case
Sidecar	Isolation, per-app config	Resource overhead	Multi-tenant, strict isolation
DaemonSet	Efficient, shared resources	Shared config	Homogeneous workloads
Gateway	Centralized processing	Single point of failure	External traffic, edge processing
Combined	Best of both worlds	Complexity	Large-scale production

Metrics Collection Pipeline

A robust metrics pipeline ensures reliable data collection and storage:

Prometheus Recording Rules

groups:
  - name: order_service_rules
    interval: 30s
    rules:
      # Pre-aggregate request rate by service
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, status_code)

      # Calculate error rate
      - record: service:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

      # P99 latency by service
      - record: service:http_latency:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
          )

Log Aggregation Architecture

Grafana Loki Architecture

Structured Logging Best Practices

@Service
public class OrderService {
    private static final Logger log = LoggerFactory.getLogger(OrderService.class);

    public Order processOrder(OrderRequest request) {
        MDC.put("orderId", request.getOrderId());
        MDC.put("customerId", request.getCustomerId());
        MDC.put("traceId", Span.current().getSpanContext().getTraceId());

        try {
            log.info("Processing order",
                kv("orderTotal", request.getTotal()),
                kv("itemCount", request.getItems().size()),
                kv("paymentMethod", request.getPaymentMethod())
            );

            Order order = createOrder(request);

            log.info("Order processed successfully",
                kv("orderId", order.getId()),
                kv("status", order.getStatus()),
                kv("processingTimeMs", order.getProcessingTime())
            );

            return order;
        } catch (PaymentException e) {
            log.error("Payment processing failed",
                kv("errorCode", e.getErrorCode()),
                kv("errorMessage", e.getMessage()),
                e
            );
            throw e;
        } finally {
            MDC.clear();
        }
    }
}

Correlating Traces, Metrics, and Logs

The true power of observability emerges when you can seamlessly navigate between pillars using correlation keys:

Correlation Key	Purpose
Trace ID	Links all signals for a single request
Service Name	Groups signals by service
Timestamp	Aligns signals in time windows
Request ID	Application-level correlation

Exemplar-Based Correlation

// Recording exemplars in Go
var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request duration with exemplars",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "path", "status"},
)

func recordRequestDuration(ctx context.Context, method, path, status string, duration float64) {
    span := trace.SpanFromContext(ctx)
    traceID := span.SpanContext().TraceId().String()

    // Record metric with exemplar containing trace ID
    requestDuration.WithLabelValues(method, path, status).(prometheus.ExemplarObserver).
        ObserveWithExemplar(duration, prometheus.Labels{
            "traceID": traceID,
        })
}

eBPF-Based Observability

eBPF (extended Berkeley Packet Filter) enables deep observability without application changes:

Tool	Use Case	Data Captured
Cilium	Network observability	L3/L4/L7 metrics, network policies
Pixie	Auto-instrumentation	HTTP traces, SQL queries, no code changes
Tetragon	Security observability	Syscall audit, file access, process exec
bpftrace	Ad-hoc debugging	Custom kernel/user probes

SLIs, SLOs, and Error Budgets

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) translate observability data into actionable reliability targets:

Component	Definition	Example
SLI	Measurable indicator of service level	P99 latency, error rate
SLO	Target value for an SLI	99.9% requests < 200ms
Error Budget	Allowed unreliability	0.1% = 43.2 min/month downtime

SLO Configuration Example

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: order-service-slo
spec:
  service: order-service
  slos:
    - name: availability
      objective: 99.95
      description: "Order service must be available 99.95% of time"
      sli:
        events:
          errorQuery: |
            sum(rate(http_requests_total{
              service="order-service",
              status_code=~"5.."
            }[{{.window}}]))
          totalQuery: |
            sum(rate(http_requests_total{
              service="order-service"
            }[{{.window}}]))

Cost Optimization Strategies

Strategy	Implementation	Cost Reduction
Label Management	Drop high-cardinality labels	30%
Tiered Storage	Hot/Warm/Cold with Thanos	50%
Log Filtering	Drop debug logs in production	70%
Trace Sampling	Tail-based sampling for errors	90%

Tail-Based Sampling with OpenTelemetry

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      # Always sample errors
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always sample slow requests
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 2000

      # Sample 10% of remaining traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Conclusion

Building a comprehensive observability platform for microservices requires thoughtful architecture across all three pillars—metrics, logs, and traces. The key takeaways:

Standardize on OpenTelemetry - Vendor-neutral instrumentation provides flexibility and avoids lock-in
Correlate Everything - Use trace IDs, exemplars, and timestamps to seamlessly navigate between signals
Implement SLOs - Transform raw data into actionable reliability targets with error budgets
Adopt eBPF - Gain deep observability without application changes, especially for network and security
Optimize for Cost - Use sampling, aggregation, and tiered storage to manage costs at scale
Automate Response - Connect alerts to runbooks and automation for faster incident resolution