The Observability System I Almost Shipped

I finished a PRD on Sunday for the LLM observability and evaluation stack we’re adding to Revelara. Five layers, twenty-one user stories, async eval scoring, drift detection, the whole thing. I thought it looked great. I was happy with it. Before I started on the implementation, I ran it through our new STPA review tool (mostly because I didn’t trust how happy I was with it, so dogfooding seemed like the best test).

It came back with seven findings and six loss scenarios. Every single one was about the observability system failing to actually observe.

The evaluator would have scored responses users never received. The drift detector would have been blind for 24 hours after every deploy. Eval scores would have raced the OTel batch exporter and lost. And the drift alerts (which were architecturally beautiful) would have fired into a database table that nothing was watching.

None of these were bugs in the way most engineers use the word. Every component, in isolation, worked. The failures lived in the spaces between components. Assumptions one piece made about another piece that nobody had ever written down. Feedback paths that didn’t exist. Timing dependencies that nobody had negotiated.

This is the failure mode that traditional reliability tools structurally cannot find, and it is most of what hurts us in production.

A short detour through STPA

Systems-Theoretic Process Analysis was developed by Nancy Leveson at MIT (her book Engineering a Safer World is the canonical reference, and the 2018 STPA Handbook is the practical one). It came out of aerospace, where you cannot wait to learn from failure because the failure is a smoking crater. Google SRE adapted it for software in 2021 and reported finding design defects, with two engineers working part-time over five months, that would have prevented at least four major incidents.

The method is structurally simple. Model your system as controllers, controlled processes, control actions flowing down, and feedback flowing up. For every control action, ask four questions. What if it isn’t provided? What if it’s provided incorrectly? What if the timing is wrong? What if the duration is wrong? Trace each unsafe answer to a concrete scenario.

That’s it. The hard part isn’t the framework. The hard part is asking all four questions about every action and accepting “I don’t know” as a finding worth writing down.

What it does that FMEA and FTA cannot is treat the system as a system. Failure modes and fault trees both assume accidents come from broken components. Modern software (and especially anything with an LLM in the loop) mostly fails in the other direction: every component works, the interaction kills you. Causal analysis, which is what most AI-for-SRE tooling is doing under the hood right now, is still looking for the broken link in a chain of events. When the failure isn’t in a chain but in the shape of the graph, there is no broken link to find. Leveson has been saying this for twenty years. The industry has mostly responded by buying more observability tools, and more recently, by pointing LLMs at those tools.

The PRD

The system I was reviewing does dual-export OTel tracing (Cloud Trace for ops, Langfuse for LLM quality), runs two evaluators asynchronously after each chat response (citation grounding, which checks that incident short names cited in a response actually exist in the database; tool efficiency, which scores iterations-to-answer and dedup behavior), wraps every tool call in a child span, and runs a drift detector hourly that compares a 24-hour window against a 7-day baseline.

The design was good. I would have built it. Here are four of the seven things STPA caught that I wouldn’t have, ordered roughly by how badly each one would have embarrassed me later.

Finding 1: The evaluator would have scored responses users never received

The chat response streams to the user over WebSocket. The user is on a flaky mobile connection. 70% of the tokens make it through, then the connection drops. The user sees a truncated answer that cuts off before the citations. From their seat, they got garbage.

Back on the server, generateWithTools() has the full response in memory. It hands the complete text to RunEvaluators(). The citation evaluator finds all four cited incidents in the full text, verifies them against the database, and scores it 1.0. That score lands in eval.scores. The numbers say this user got a perfect answer.

The bigger problem is what this does to the baseline. On mobile-heavy traffic we’d be persisting phantom-good scores for responses that were never actually delivered. The 7-day baseline drifts upward, anchored by answers nobody saw. When real quality regresses later, the drift detector compares against an inflated baseline and needs a bigger drop to trip. The system’s definition of normal silently ratchets in the wrong direction.

This is Goodhart’s Law dressed up in microservices. The eval score was supposed to measure response quality. It actually measures generation quality, which we treated as interchangeable because that’s the shape of the data flowing into the controller. The WebSocket layer has information the evaluator needs, and there is no path between them.

The fix is not subtle: RunEvaluators() needs a delivery status signal from the WebSocket layer, and partial deliveries either get filtered out of the baseline or get scored separately. A new field on the eval input struct. Maybe an afternoon. Whereas the cost of catching this after a quarter of inflated baselines is a data cleanup, a baseline rebuild, and a much harder conversation with whoever was looking at the dashboards. Not to mention possibly months of poor user interactions before discovering there’s a problem.

The detector runs hourly. It computes the mean over the last 24 hours and compares against the 7-day baseline. Reasonable.

Now picture a deploy. An engineer ships a prompt change at noon. The config_hash attribute on every new trace reflects the new prompt. But the trailing 24-hour window still contains roughly 23 hours of old-prompt scores and a few minutes of new-prompt scores. The old data dominates the mean. No drift detected. It takes a full day for the window to clear, which is a full day during which a regression introduced by the deploy is invisible.

That’s not the whole of it, and the second half is the part I’m a little embarrassed I didn’t catch on my own. The 7-day baseline blends scores from multiple prompt versions. A prompt that scored 0.9 for five days followed by a regression to 0.6 for two days produces a baseline of about 0.81. The bad version “only” looks 0.21 below baseline. The severity calculation systematically understates what actually happened, because the baseline is contaminated by the very thing we’re trying to detect.

Every individual choice in the design was defensible. 24 hours is a fine window. Seven days is a fine baseline. Two sigma is a fine threshold. The interaction between those choices and a normal deployment cadence produced a detector that couldn’t see the most dangerous window.

The fix is to treat config_hash as a partition key, not a passive attribute. New version means new baseline, and we compare the new version’s first hundred responses against the previous version’s established baseline. A bad prompt gets caught in minutes instead of a day later, against a baseline it has already polluted.

Finding 3: Eval scores would have raced the span exporter and lost, intermittently, forever

This one bothered me more than the others.

Async eval means the goroutine attaches eval.score to the parent chat.generate_with_tools span some seconds after the handler returns. The span itself ends when the handler ends. The OTel SDK has its own batch export timing (default five seconds) that decides when spans actually ship to Langfuse.

Pick those three timings out of a hat and you can see the race forming. If the batch export fires before the eval goroutine finishes, the span ships to Langfuse with no eval data. When the goroutine eventually calls span.AddEvent(), the event lands on a span handle whose data has already been sent. Langfuse displays the trace, but the score is missing. The trace looks unevaluated even though it was scored. Just too late.

The failure is intermittent. Sometimes eval finishes in two seconds and beats the batch. Sometimes it takes eight seconds (the verification queries hit the DB, latency varies) and loses. The engineer opens Langfuse to investigate a reported bad response, sees more than half the traces in the window have no scores, and quietly stops trusting Langfuse. The natural next move is to fall back to raw SQL on eval.scores, which defeats the entire reason we wired up Langfuse in the first place.

The PRD assumed span events could be added anytime before export. The OTel contract does not actually promise that. So the design implicitly required a synchronization primitive between two async producers and one timed consumer, and we had not built one.

We picked the simplest fix: give the eval goroutine its own child span that it owns end-to-end, decoupled from the parent’s lifecycle. Two other options were on the table (the eval goroutine takes ownership of the parent span, or the parent defers export until eval signals completion with a timeout fallback), but I’d rather not put the chat handler in the business of managing OTel span lifecycle. That’s a separation-of-concerns argument I would lose in code review eventually.

The bigger lesson here is that we are building a tool to investigate bad responses, and the tool itself was structurally set up to lie to us intermittently. STPA didn’t catch a bug in the eval scorer. It caught a bug in the system that reports on the eval scorer. That’s the kind of thing tests don’t find because the components are all behaving correctly.

Finding 4: The drift alerts would have fired into a table nobody was watching

I laughed when this one came up.

The drift detector design was clean. Statistical thresholds, minimum sample sizes, severity classification, a dedicated eval.drift_alerts table. The team had clearly thought about when to fire an alert.

Nowhere in the PRD was there a sentence about how the alert reaches a human.

When the detector fires, it inserts a row into PostgreSQL and attaches a span event to the current trace in Langfuse. That’s it. No Slack message. No PagerDuty page. No Google Cloud Monitoring policy. The engineer who deployed the offending prompt checks Cloud Trace, sees no errors, normal latency, and moves on. Three days later they find out about the regression because a customer files a ticket.

In STPA terms, the human controller furthest from the loss has no feedback from the detection layer. We designed the detection system with care and forgot to design the notification system at all. How often does this happen? Detection gets scoped as an “eval module” concern and notification got scoped as an “infrastructure” concern, and nobody owned the edge between them? All the time. I should know better and I still missed it.

The fix is small. A Prometheus counter (eval_drift_alerts_total{evaluator, severity}) that increments when an alert fires. A Google Cloud Monitoring policy on the counter, routed to our existing channel. Zero new infrastructure. The whole thing is maybe an hour.

What I keep coming back to is that every component would have worked. The detector would have detected. The database would have stored. The trace would have shown the event to anyone who happened to open it. And the system, taken end to end, still would have been useless. We would have shipped a quality alarm wired to nothing.

What’s actually going on here

Four findings, four different surfaces, the same shape underneath.

In every case, a controller was acting on a process model that was subtly wrong. The evaluator equated “generated” with “delivered.” The drift detector treated a continuous time window as a sound abstraction over a discontinuous deployment history. The TracerProvider assumed span events could arrive after span end. The notification layer assumed a database insert counted as telling somebody.

None of those assumptions are wrong in principle. They’re wrong in this specific context because of something happening in an adjacent component that the controller had no feedback from. That’s the entire premise of the method. Accidents arise from interactions among correctly-functioning components that violate system-level constraints. STPA makes the interactions visible by forcing you to enumerate every control action and every feedback path, and to ask whether the feedback actually answers the question the action requires.

Unit tests don’t catch this because the units are all fine. Integration tests don’t catch this because the bugs only surface under timing or failure conditions that integration tests don’t usually stage. Code review doesn’t catch this because the assumptions live across files (sometimes across services) and reviewers see one piece at a time. Monitoring doesn’t catch this because three of the four findings would have rendered monitoring itself unreliable.

What STPA did was make me sit with questions I’d been subconsciously avoiding. How does the evaluator know the response was actually delivered? It doesn’t. What happens to the 7-day baseline when we deploy a prompt change? It gets contaminated. If the drift detector fires at 3 AM, how does the on-call engineer know? They don’t. That last one (the answer being literally “they don’t”) is the kind of finding I would have happily not asked about, because it would be literally embarrassing for a 20+ year practitioner of SRE to not wire up a critical alerting chain.

The value of the method isn’t automated detection. The value is that the question template is mechanical enough that you can’t quietly skip the questions you don’t want to answer, or know the answer to.

Where this leaves me

Needless to say, I changed the PRD. All seven findings are either addressed in the design now or tracked as constraints I’ll enforce during implementation. The PRD is better because the tool ran. The tool ran because we’d already built it. That’s the loop we’re trying to make cheap for everyone else.

The bigger point is, how much of the reliability work I’ve ever done has been organized around the assumption that bugs come from broken components. Most of the worst incidents I’ve been involved in came from interactions, not failures. Every component was doing its job. The system was producing an unacceptable outcome anyway.

That’s the gap Revelara is built to close. Systems-theoretic risk analysis catching problems you didn’t know to look for, applied to your design docs and your codebase, correlated against incident patterns from the broader public corpus, surfacing the interaction hazards before they ship.

The part I find more interesting is the internal loop. Every review produces structured artifacts. Control structures, unsafe control actions, loss scenarios, findings, accepted risks. Those aren’t one-time outputs. They feed a knowledge base of how your specific systems actually fail, where your feedback paths are thin, what risks you’re knowingly carrying, and (most usefully) which of those risks later got validated by a real incident. Each new review runs with that knowledge in context. The control structure from last quarter’s billing redesign gets cross-referenced against the one you’re about to ship. A UCA that someone flagged six months ago and closed as “accepted risk” resurfaces the next time a related design touches the same boundary. The system gets smarter the longer you use it, because your organization’s specific failure topology is being written down in a form that compounds instead of getting lost in a doc folder. The same loop runs across the broader corpus too, so patterns your peers have already discovered start showing up in your reviews without you having to go find them. Most review tools are stateless. This one isn’t, and over time that’s the part that matters.

If you’re building something with enough moving parts that the last few incidents came as a surprise, check us out at revelara.ai. Pick a design doc you’re nervous about. We’ll show you what the method finds.

The seven things in my own PRD would have all become incidents eventually. I’m a lot more interested in finding them before they happen than writing postmortems about them.