Platform Engineering

OTel Strangler Fig — CultureTech Playbook

observability opentelemetry migration patterns

Context

Most teams adopting OpenTelemetry come from one of two starting points:

  1. Custom structured logs on Loki/Elastic with a schema invented 5 years ago that nobody wants to touch.
  2. Closed APM (Datadog, New Relic) with deep vendor lock-in and a six-figure annual contract.

In both cases the big-bang migration to OTel is theoretically correct but politically impossible. This playbook documents the Strangler Fig pattern by Martin Fowler applied specifically to OpenTelemetry adoption.

The Strangler Fig pattern

Originally described in 2004 by Fowler as a migration strategy: instead of rewriting the old system, plant a new system around the old one and let it grow until it suffocates it (like the actual strangler fig tree).

Three phases:

  1. Parallel capture — new data is captured in both systems simultaneously. The old one remains the source of truth.
  2. Live comparison — dashboards from the new system are contrasted against the old. When matches are consistent, the team gains confidence.
  3. Progressive switchover — service by service, the new system becomes the source of truth. The old remains for historical queries until sunset.

Applied to OpenTelemetry

Phase 1 — Parallel capture

For each service in the catalog, add the OpenTelemetry SDK without removing the old logger. The SDK exports to a local OTel Collector that queues into Kafka (or equivalent).

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: {}
      http: {}
exporters:
  kafka:
    brokers: ["kafka-otel:9092"]
    topic: otel-traces-staging
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [kafka]

Cost: the SDK adds ~5-10ms per request at P99 if well-tuned, and ~1-2% CPU. If your service can’t tolerate that, it’s not a candidate for this phase and waits.

Validation: every request must appear in both systems. If it doesn’t show in OTel, the SDK isn’t instrumenting that path — investigate before moving on.

Phase 2 — Live comparison

Replicate the critical dashboards from the old system in the new (Grafana against ClickHouse or Tempo, depending on chosen destination). Each dashboard is the same business question answered by two sources.

Metrics to compare:

  • P50/P95/P99 latency per endpoint.
  • Error rate per code.
  • Throughput per service.

If a metric doesn’t match within ±5%, there’s an instrumentation issue in OTel. Resolve before proceeding.

Phase 3 — Progressive switchover

Per service (not per dashboard), declare OTel as source of truth:

  1. Alerts migrate to OTel.
  2. Runbooks update to point at the new dashboards.
  3. The oncall team confirms they can operate with the new source.
  4. Only then disconnect the exporter to the old system.

The old system keeps receiving only traffic that hasn’t migrated yet + historical query.

Common pitfalls

  • Skipping Phase 2. It’s the longest and most boring phase — exactly why it gets skipped. Every time it’s been skipped, the first incident in the new system generates institutional doubt and the migration reverses.
  • Migrating dashboards before services. The dashboard is the output; the service is the source. Migrating a dashboard without correctly instrumented service produces “data in the new system” that isn’t real.
  • Assuming otel-collector is plug-and-play. It isn’t. Default config doesn’t work for serious volume; you need batching, memory limits and backpressure tuning.

When NOT to use Strangler Fig

  • Greenfield. If the system is new, OTel from day one. Nothing to strangle.
  • Small systems (<10 services). The organizational overhead of three phases exceeds the benefit.

Are you migrating?

If your team is in this moment and needs guidance: 30-minute Assessment, free, no sales pitch.