Observability

By Frederick Lowe, May 22, 2026

This essay is the second of six in a series discussing Transformers and Post-Transformers. It establishes what "observability" means in the context of software, alongside human and LLM thinking, and why the first meets the definition while the second and third do not.

Definitions

In this writing, I use "thinking" and "reasoning" in two contexts. To disambiguate, I preface "thinking" with "human" or "machine". Absence of the preface means human activity. The industry uses these terms fluidly, not because they are mechanistically related, but because they are graspable and loosely analogous.

Chain of Thought (CoT) is a trained behavior in which a model emits intermediate reasoning tokens before producing its final answer.

Observability, Defined

For the purpose of this essay, "observability" means the ability to obtain empirical, repeatable measurements that reveal the actual content of reasoning, not correlated signals, not self-report, not post-hoc explanations.

This definition of observability excludes inferable or deducible truth (guessing from symptoms) and Cartesian certainty (unattainable). It also excludes measurements that are only loosely coupled to the reasoning process itself.

Observability therefore looks different in software, in humans, and in LLM.

How Transformers Work

Before talking about observability, it is useful to reflect on how transformers work. Briefly, for non-technical readers:

Given a context (your question), an LLM produces output text one token at a time. Each output token is the result of processing your context through the model's layers (a forward pass). That processing outputs a probability distribution over the model's vocabulary. A token is sampled from that distribution, added to the output, and the process repeats. Each new token is conditioned on all the tokens that came before, including the ones the model itself just produced. This loop is called autoregression.

Observability in Software

Well-written software continuously writes a log of its internal state: decisions, transitions, errors. Normal logs are boring, but they serve a critical purpose: monitoring tools consume them to confirm normal behavior and alert operators when something fails.

In good code with faithful, detailed telemetry, the logs will state why a failure occurred, down to the library, function, line, reason, and relevant preconditions. Even when a failure corrupts logging, we can:

inspect related system behavior (databases, network traces)
wrap the software in memory-tracing tools
write new diagnostic code to surface the cause

Unless the software has been deliberately altered to misrepresent its state (malware, malicious author), it is observable: the log entries are empirical, repeatable measurements of the reasoning that led to each state.

Observability in Human Thinking

A neuroscientist can see regional brain activation using fMRI. More invasive technologies increase resolution. A complete connectome would be invaluable, but it would be person-specific and invalidated by synaptic plasticity.

But none of this gives us observability of what a person is actually reasoning about. fMRI measures blood flow, not thought content. Self-report is unreliable: a person can say "I'm thinking about dinner" while thinking about something entirely unrelated. Reaction times and error rates are coarse proxies. There is no known or foreseeable measurement that lets us read out the propositional content of a human's internal monologue with any fidelity.

Thus human reasoning fails the definition of observability. We cannot empirically, repeatably measure the actual reasoning steps a person takes.

Observability in LLM Thinking

In September 2024, OpenAI introduced OpenAI o1, the first frontier LLM trained to produce extended reasoning traces before answering. o1 also commercially introduced Chain of Thought (CoT) tracing: the model emits tokens about its own processing. Despite the UI hobbling (output summaries instead of raw tokens), the feature was commercially successful.

That success is unsurprising. People want to see what others think about them and the things they say. CoT traces read like an observer's assessment: "The user is asking a thoughtful question about X … I should respond with Y." And they feel like magic, because they operate on one of the deepest human needs: to see and be seen by another consciousness. But CoT tokens are not observable reasoning.

They are generated text about reasoning. Research shows systematic misalignment between CoT output and the model's latent reasoning. LLM can reach conclusions the CoT does not reflect, use strategies the CoT omits, and contradict the CoT in their final answer.

Worse: commercial pressure produces models trained to produce sycophantic CoT that includes flattery and avoids conflict and insults (even when warranted). After all, LLM are products, and no one pays $200/mo to be called stupid in a "thinking" block.

Thus LLM thinking also fails the definition of observability. The tokens we see are not empirical, repeatable measurements of internal reasoning; they are plausible narratives, shaped by training that rewards unoffensive interaction, unmoored from the actual computation.

Conclusion

Uncompromised software is observable: its logs give us direct, faithful access to its decision-making. Human thinking is not observable: we have no instrument that reads out the content of a person's reasoning, and probably never will. LLM thinking is also not observable: CoT traces are at best, informative, and at worst, sycophancy. Whether we should trust CoT despite its unobservability is the subject of next week's essay.