Microservices Observability Architecture: The Three Pillars and Beyond

Build comprehensive observability for microservices with distributed tracing, metrics, logs, and the emerging eBPF-based observability using OpenTelemetry, Prometheus, Jaeger, and Grafana.

GT
Gonnect Team
January 11, 202517 min readView on GitHub
OpenTelemetryPrometheusGrafanaJaegereBPFLoki

Introduction

In the world of distributed systems, understanding what happens inside your applications has never been more challenging—or more critical. As organizations transition from monolithic architectures to microservices, traditional monitoring approaches fall short. You cannot simply look at a single server's CPU or memory to understand system health. Instead, you need observability—the ability to understand the internal state of your system by examining its external outputs.

This article explores the comprehensive architecture for implementing observability in microservices environments, covering:

  1. The Three Pillars - Metrics, Logs, and Traces
  2. OpenTelemetry - The unified standard for instrumentation
  3. Correlation Strategies - Connecting signals across pillars
  4. eBPF-Based Observability - The next frontier
  5. SLIs, SLOs, and Error Budgets - Reliability engineering
  6. Cost Optimization - Scaling observability economically

Key Insight: Observability is not just about collecting data—it's about asking arbitrary questions of your system without having to know those questions in advance.

The Three Pillars of Observability

The observability ecosystem is built on three complementary signals, each providing unique insights into system behavior:

PillarWhat It CapturesBest ForQuestions Answered
MetricsNumeric measurements (counters, gauges, histograms)Alerting, dashboards, capacity planningWhat is happening? (Real-time health)
LogsDiscrete events with contextDebugging, audit, complianceWhy did it happen? (Root cause)
TracesRequest flow across servicesPerformance analysis, dependency mappingWhere did it happen? (Request flow)

Understanding Each Pillar

Metrics provide aggregated, time-series data perfect for dashboards and alerting. They answer "what" questions with low cardinality.

Logs capture discrete events with rich context. They answer "why" questions when debugging issues.

Traces follow requests across service boundaries. They answer "where" questions about distributed flows.

Observability Platform Architecture

A production-grade observability platform requires careful orchestration of multiple components:

Microservices Architecture

Loading diagram...

Key Components

LayerComponentsPurpose
CollectionOTel Collector, Prometheus Scraper, Fluent BitGather telemetry from applications
ProcessingKafka, Spark StreamingBuffer, enrich, and transform data
StoragePrometheus, Thanos, Loki, TempoStore metrics, logs, and traces
VisualizationGrafana, Jaeger UIQuery and visualize data
AlertingAlertmanager, PagerDutyNotify on-call teams

OpenTelemetry: The Unified Standard

OpenTelemetry has emerged as the industry standard for instrumenting applications. It provides vendor-neutral APIs, SDKs, and tools for generating telemetry data.

OpenTelemetry for AI/ML

Loading diagram...

OpenTelemetry Components

ComponentFunction
APIDefines interfaces for instrumentation
SDKImplements API with providers and exporters
CollectorReceives, processes, and exports telemetry
Auto-instrumentationZero-code instrumentation for frameworks

Manual Instrumentation (Go)

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/trace"
)

var tracer trace.Tracer

func init() {
    tracer = otel.Tracer("order-service")
}

func ProcessOrder(ctx context.Context, order Order) error {
    ctx, span := tracer.Start(ctx, "ProcessOrder",
        trace.WithAttributes(
            attribute.String("order.id", order.ID),
            attribute.String("order.customer_id", order.CustomerID),
            attribute.Float64("order.total", order.Total),
        ),
    )
    defer span.End()

    // Validate order
    if err := validateOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "validation failed")
        return err
    }

    // Process payment (with child span)
    if err := processPayment(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "payment failed")
        return err
    }

    span.SetStatus(codes.Ok, "order processed successfully")
    return nil
}

Context Propagation Across Services

// HTTP Client with context propagation
func callInventoryService(ctx context.Context, items []Item) error {
    ctx, span := tracer.Start(ctx, "CheckInventory",
        trace.WithSpanKind(trace.SpanKindClient),
    )
    defer span.End()

    req, _ := http.NewRequestWithContext(ctx, "POST",
        "http://inventory-service/check",
        marshalItems(items))

    // Inject trace context into HTTP headers
    otel.GetTextMapPropagator().Inject(ctx,
        propagation.HeaderCarrier(req.Header))

    resp, err := httpClient.Do(req)
    if err != nil {
        span.RecordError(err)
        return err
    }
    defer resp.Body.Close()

    span.SetAttributes(attribute.Int("http.status_code", resp.StatusCode))
    return nil
}

Collector Deployment Patterns

Collector Deployment Topology

Loading diagram...
PatternProsConsUse Case
SidecarIsolation, per-app configResource overheadMulti-tenant, strict isolation
DaemonSetEfficient, shared resourcesShared configHomogeneous workloads
GatewayCentralized processingSingle point of failureExternal traffic, edge processing
CombinedBest of both worldsComplexityLarge-scale production

Metrics Collection Pipeline

A robust metrics pipeline ensures reliable data collection and storage:

Prometheus Recording Rules

groups:
  - name: order_service_rules
    interval: 30s
    rules:
      # Pre-aggregate request rate by service
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, status_code)

      # Calculate error rate
      - record: service:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

      # P99 latency by service
      - record: service:http_latency:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
          )

Log Aggregation Architecture

Grafana Loki Architecture

Loading diagram...

Structured Logging Best Practices

@Service
public class OrderService {
    private static final Logger log = LoggerFactory.getLogger(OrderService.class);

    public Order processOrder(OrderRequest request) {
        MDC.put("orderId", request.getOrderId());
        MDC.put("customerId", request.getCustomerId());
        MDC.put("traceId", Span.current().getSpanContext().getTraceId());

        try {
            log.info("Processing order",
                kv("orderTotal", request.getTotal()),
                kv("itemCount", request.getItems().size()),
                kv("paymentMethod", request.getPaymentMethod())
            );

            Order order = createOrder(request);

            log.info("Order processed successfully",
                kv("orderId", order.getId()),
                kv("status", order.getStatus()),
                kv("processingTimeMs", order.getProcessingTime())
            );

            return order;
        } catch (PaymentException e) {
            log.error("Payment processing failed",
                kv("errorCode", e.getErrorCode()),
                kv("errorMessage", e.getMessage()),
                e
            );
            throw e;
        } finally {
            MDC.clear();
        }
    }
}

Correlating Traces, Metrics, and Logs

The true power of observability emerges when you can seamlessly navigate between pillars using correlation keys:

Correlation KeyPurpose
Trace IDLinks all signals for a single request
Service NameGroups signals by service
TimestampAligns signals in time windows
Request IDApplication-level correlation

Exemplar-Based Correlation

// Recording exemplars in Go
var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request duration with exemplars",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "path", "status"},
)

func recordRequestDuration(ctx context.Context, method, path, status string, duration float64) {
    span := trace.SpanFromContext(ctx)
    traceID := span.SpanContext().TraceId().String()

    // Record metric with exemplar containing trace ID
    requestDuration.WithLabelValues(method, path, status).(prometheus.ExemplarObserver).
        ObserveWithExemplar(duration, prometheus.Labels{
            "traceID": traceID,
        })
}

eBPF-Based Observability

eBPF (extended Berkeley Packet Filter) enables deep observability without application changes:

ToolUse CaseData Captured
CiliumNetwork observabilityL3/L4/L7 metrics, network policies
PixieAuto-instrumentationHTTP traces, SQL queries, no code changes
TetragonSecurity observabilitySyscall audit, file access, process exec
bpftraceAd-hoc debuggingCustom kernel/user probes

SLIs, SLOs, and Error Budgets

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) translate observability data into actionable reliability targets:

ComponentDefinitionExample
SLIMeasurable indicator of service levelP99 latency, error rate
SLOTarget value for an SLI99.9% requests < 200ms
Error BudgetAllowed unreliability0.1% = 43.2 min/month downtime

SLO Configuration Example

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: order-service-slo
spec:
  service: order-service
  slos:
    - name: availability
      objective: 99.95
      description: "Order service must be available 99.95% of time"
      sli:
        events:
          errorQuery: |
            sum(rate(http_requests_total{
              service="order-service",
              status_code=~"5.."
            }[{{.window}}]))
          totalQuery: |
            sum(rate(http_requests_total{
              service="order-service"
            }[{{.window}}]))

Cost Optimization Strategies

StrategyImplementationCost Reduction
Label ManagementDrop high-cardinality labels30%
Tiered StorageHot/Warm/Cold with Thanos50%
Log FilteringDrop debug logs in production70%
Trace SamplingTail-based sampling for errors90%

Tail-Based Sampling with OpenTelemetry

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      # Always sample errors
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always sample slow requests
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 2000

      # Sample 10% of remaining traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Conclusion

Building a comprehensive observability platform for microservices requires thoughtful architecture across all three pillars—metrics, logs, and traces. The key takeaways:

  1. Standardize on OpenTelemetry - Vendor-neutral instrumentation provides flexibility and avoids lock-in

  2. Correlate Everything - Use trace IDs, exemplars, and timestamps to seamlessly navigate between signals

  3. Implement SLOs - Transform raw data into actionable reliability targets with error budgets

  4. Adopt eBPF - Gain deep observability without application changes, especially for network and security

  5. Optimize for Cost - Use sampling, aggregation, and tiered storage to manage costs at scale

  6. Automate Response - Connect alerts to runbooks and automation for faster incident resolution

Further Reading