Esc
Type to search posts, tags, and more...
Skip to content

The prediction machine

LLMs are prediction machines. Hallucinations aren't bugs — they're the expected failure mode. Here's what that means for how you work with them.

Contents

Every conversation about LLMs eventually lands on hallucinations. Someone pastes a confidently wrong answer into Slack, someone else replies “this is why I don’t trust AI,” and the thread devolves into familiar territory. But if you’ve spent real time building with these models, you know the framing is wrong. Hallucinations aren’t some mysterious failure mode. They’re the most predictable thing about the system.

An LLM is a prediction machine. The architecture that made this possible — the transformer, introduced in Attention Is All You Need — uses self-attention to weigh the relevance of every token in the input when predicting the next one. It takes a sequence of tokens and predicts the next one. Then it takes that prediction, appends it to the sequence, and predicts again. That’s the entire mechanism. Every answer you’ve ever gotten from a language model — the brilliant ones and the catastrophically wrong ones — came from this same process: a probability distribution over what token comes next, conditioned on everything before it.

When the model tells you that a Python function needs a self parameter, it’s because that pattern is overwhelmingly represented in its training data. When it invents a library that doesn’t exist, it’s because the surrounding context made that prediction statistically plausible. The mechanism didn’t change. The distribution was wrong.

Hallucinations aren’t bugs. They’re the expected failure mode of a system that has no way to verify its own output.

Chain-of-thought doesn’t change the game

You might think chain-of-thought prompting changes this. Models that “think step by step” seem qualitatively different — they break problems down, show their work, arrive at answers through what looks like reasoning. And the outputs do improve. But the mechanism is identical.

When a model writes out intermediate steps, it’s generating tokens that shift the probability distribution for subsequent tokens. “Let me think about this step by step” is a prompt that makes the next tokens more likely to be structured, sequential, and internally consistent. The intermediate tokens act as a kind of scratchpad — they change what the model predicts next by changing the context window.

This is a real and useful technique. It meaningfully improves accuracy on tasks that benefit from decomposition. But calling it “reasoning” obscures what’s happening. The model isn’t reasoning its way to an answer. It’s generating tokens that make the final answer tokens more likely to be correct. Apple’s research team demonstrated this clearly in The Illusion of Thinking — they tested reasoning models on puzzles with controllable complexity and found three regimes: at low complexity, standard models outperform “thinking” models; at medium complexity, the extra tokens help; and at high complexity, both collapse completely. The thinking models didn’t develop generalizable reasoning. They hit a wall, and — counterintuitively — started reducing their reasoning effort as problems got harder, even with token budget to spare.

The distinction matters because it tells you exactly when chain-of-thought will help (structured, decomposable problems) and when it won’t (problems requiring information the model doesn’t have, or problems beyond a complexity threshold that the model can’t brute-force with token generation).

The confidence problem

Here’s where the training pipeline makes things worse. RLHF — reinforcement learning from human feedback — optimizes models for responses that humans rate as helpful. Humans rate confident, complete answers higher than hedged or uncertain ones. “I don’t know” rarely wins in the training signal.

The result is a system that’s been specifically optimized to sound confident, even when the underlying probability distribution is nearly flat. The model might be almost equally split between three possible answers, but it’ll deliver one of them with the authoritative tone of a senior engineer who’s done this a hundred times.

This isn’t an inherent property of the architecture. It’s a design choice in the training pipeline. Kalai et al. lay this out precisely in Why Language Models Hallucinate — hallucinations originate as errors in binary classification, and they persist because evaluations are graded like exams where guessing beats leaving the answer blank. You could train a model that says “I’m uncertain” more often — but it would score worse on benchmarks, and benchmarks drive funding. The incentive structure rewards guessing over epistemic honesty.

When a model sounds confident, that tells you nothing about whether it’s correct. Confidence is a property of the training objective, not a signal of accuracy.

The lights are on but nobody’s home

Prediction machines lack things that you take for granted in any human collaborator. Not as a philosophical claim — as a practical engineering constraint.

No metacognition. The model doesn’t know what it doesn’t know. It can’t distinguish between a domain where its training data was rich and one where it was sparse. It can’t flag “I’ve never seen a problem like this before.” You get the same fluent, confident output whether the model is operating in its sweet spot or way outside its depth. This is the core problem: the lights are on but nobody’s home. There’s no internal process evaluating the quality of its own output.

No persistent memory. Each session starts from zero. The model doesn’t remember that it gave you a wrong answer yesterday, doesn’t learn from corrections within a conversation (beyond the context window), doesn’t accumulate experience. Every interaction is a fresh roll of the probability dice conditioned on whatever context you’ve provided.

No intrinsic motivation. There’s no curiosity, no drive, no agenda. The model doesn’t want to solve your problem. It doesn’t care if the answer is right. It’s completing a sequence. This sounds obvious when stated plainly, but it’s easy to forget when you’re three hours into a pair-programming session and the model is being genuinely helpful. The helpfulness is a pattern in the training data, not an intention.

None of this is a moral or philosophical argument. It’s a spec sheet. When you understand the constraints, you stop being surprised by the failures and start designing around them.

Why it works anyway

Given all of this, it’s remarkable that LLMs work as well as they do. The explanation isn’t mysterious either: natural language encodes far more structure than anyone assumed.

When you write a sentence about configuring a BGP peering session, that sentence carries implicit knowledge about network topology, protocol state machines, vendor CLI syntax, and operational best practices. Language isn’t a thin veneer over human knowledge — it’s a dense encoding of it. The training process compressed an enormous corpus of human communication into statistical patterns, and those patterns capture genuine structure about how things work.

This is why LLMs can generate working code they’ve never “seen” — the patterns of correct code, correct reasoning, and correct domain knowledge are embedded in the statistical relationships between tokens. The model doesn’t understand BGP, but it’s absorbed enough structured text about BGP to produce outputs that are frequently correct.

But the compression is lossy. Think of an LLM as a lossy compression of a massive amount of data — it works well for common patterns but loses fidelity in the long tail. Ask it to explain BGP path selection and you’ll get a solid answer. Ask it for the exact attributes of a specific route-map command on the latest IOS XE release and you’re testing the edges of what the compression retained. That’s not a practical way to assess what these models can do — it’s like judging a JPEG by zooming to individual pixels.

The input matters more than people realize. Ask a model to “generate a bedtime story for my daughter” and you’ll get one of maybe three versions it converges on — despite having seen thousands of children’s books during training. But extend your query with character descriptions, a setting, and a conflict, and it’ll produce something genuinely good. The richer the input context, the more of that compressed knowledge you unlock. This is why treating an LLM as a question-answering oracle is the wrong mental model. It’s a prediction machine — and the quality of its predictions is directly proportional to the quality of what you feed it.

Agentic workflows as the practical bridge

The raw model can’t verify its output, can’t learn from mistakes, can’t access current information. But you can build systems around it that compensate for every one of these limitations.

Give the model a code execution environment and it can test its own suggestions. Give it retrieval tools and it can ground its answers in real data. Put a human in the loop and you get the metacognition the model lacks. The agentic pattern — observe, plan, act, evaluate — creates a feedback mechanism that the raw model doesn’t have.

Two concrete examples of this in practice:

  • Grounded retrieval via MCP. Instead of relying on what the model remembers about a library’s API, you give it a tool that fetches current documentation on demand. With something like Context7 MCP, an instruction like “use context7 to look up the docs for this library” makes the model fetch up-to-date reference material and use that as context for its answer — not its compressed, potentially stale training data. RAG and tool-based retrieval solve the lossy compression problem by bringing curated knowledge into the context window at inference time.
  • The agentic correction loop. Claude Code writing a Python script doesn’t stop at generating the code. It runs the linter — if there are errors, it corrects them. It executes the script — if the output is wrong, it iterates. This write → lint → run → validate → correct cycle is a form of external metacognition. The model still can’t evaluate its own output internally, but the environment provides the feedback it needs to converge on a correct result.
#
Related: LLM as CPU, agent as OS

The agent architecture analogy — LLM as CPU, context as RAM, tools as syscalls — and how it maps to building real systems.

This is why the trajectory of AI tooling has moved so aggressively toward agents. Not because models got dramatically smarter (though they have improved), but because wrapping a prediction machine in an execution loop with external feedback transforms what it can do. The model provides the language understanding and pattern matching. The environment provides the ground truth.

The context engineering angle

There’s a related skill here: structuring the context you feed the model so its predictions are more likely to be correct. Project constitutions, explicit coding standards, curated reference docs — these aren’t nice-to-have documentation practices. They’re engineering the probability distribution. The better the context, the better the predictions. I’ve written about the practical side of this in building a system for AI-assisted engineering.

Where your value actually is

If an LLM can produce working code in seconds, what’s the point of being an engineer?

The answer is everything the prediction machine can’t do. You carry a mental model of your entire system — not the syntax of a single function, but how data flows through it, where the failure modes are, what the operational characteristics look like at 3 AM on a Saturday. The model sees the current context window. You see the whole picture.

You have metacognition. When the model produces a confident answer about a database migration, you know whether that answer accounts for your production traffic patterns, your SLA constraints, your team’s operational capacity. You know what you don’t know, and you know when to dig deeper before committing. The model doesn’t have that ability — it’ll suggest a migration strategy with the same confidence whether your database is 10 GB or 10 TB.

You have judgment about what to build. The model can help you build anything you describe — but it can’t tell you what’s worth building. It can’t see the gap in your product, the operational pain point that’s costing your team hours a week, the architectural decision that’ll save you six months of rework. Vision and taste are human inputs. The model is a construction crew, not an architect.

The bar for engineering value shifts upward, and that’s a good thing. The mechanical parts of the job — remembering syntax, writing boilerplate, translating requirements into initial code — those are increasingly handled by the prediction machine. What’s left is the work that was always the hard part: system design, failure analysis, judgment calls under uncertainty, and the ability to look at a confident answer and say “no, that’s wrong, and here’s why.”

The engineers who thrive aren’t the ones who can write code faster than a model. They’re the ones who know when the model is wrong — and have the depth to explain why.

! Was this useful?