Notes on our observability stack
The current observability stack is three tools deep: a metrics store,
a log aggregator, and a tracing pipeline. We tried four others over
the last two years and dropped each for reasons that surprised us at
the time but look obvious in hindsight.
What we kept
Metrics: Prometheus + a long-term store. Logs: a self-hosted Loki
cluster. Traces: OpenTelemetry pipeline routed to a hosted backend
with three months of full-fidelity retention.
What we cut
Two APM tools (overlapping with traces, billing got out of hand at
our cardinality), a synthetic monitoring SaaS (replaced by a
home-grown probe network running on the edge fleet), and a profiler
(useful but the team did not check the dashboards).
← All posts