The prediction machine
The models are good now. The engineering discipline is managing certainty vs uncertainty in their output — knowing when to trust, when to verify, and how to push the certainty floor higher.
Contents
Late 2025 was the inflection point. Agentic coding crossed from “mostly works” to “actually works.” I write most of my code through agents now. Not as an experiment — as default workflow. That trust was earned through engineering, not faith.
But “good” doesn’t mean “certain.” And that’s the whole game.
An LLM is a prediction machine. It takes a sequence of tokens and predicts the next one. Some of those predictions are near-certain — common patterns, well-represented training data. Others are coin flips — long-tail knowledge, novel combinations, information that postdates the training cutoff. The model works the same way in both cases. The mechanism doesn’t change. The probability distribution does.
The certainty spectrum
Next-token prediction creates a spectrum. On one end: the model predicting that a Python method needs self, or that a for loop in Go uses range. These patterns are so deeply represented in the training data that the prediction is almost deterministic. On the other end: the exact flags for a CLI tool released last month, or the correct import path for a library that reorganized its package structure between versions.
The problem isn’t that models are uncertain. Uncertainty is expected from a compression of human knowledge. The problem is that RLHF trained models to sound equally confident across the entire spectrum. You hear the same authoritative tone whether the model is recalling something it’s seen ten million times or inventing a CLI flag that doesn’t exist.
The model can’t tell you where it is on the spectrum. That’s your job.
When the model sounds confident, that tells you nothing about whether it’s correct. Confidence is a property of the training objective, not a signal of accuracy.
Engineering certainty
If you can’t eliminate uncertainty, you can engineer around it. Three patterns that work.
Red/green TDD. Tests are a binary oracle that converts model uncertainty into verified output. The model writes code it’s 70% sure about. The test gives you pass or fail. You iterate. You converge. The model’s internal confidence doesn’t matter — the test does. This is the simplest and most reliable verification loop: let the model guess, let the test judge.
Templates and constraints. The narrower the prediction space, the higher the baseline certainty. A CLAUDE.md with explicit coding standards, a frontmatter schema with validated fields, a skill file with formatting rules — these aren’t documentation. They’re constraints that shrink the range of plausible predictions. The model stops guessing your conventions and starts following them.
Grounded retrieval via MCP. Instead of relying on what the model remembers about a library’s API — lossy, potentially stale — you bring current documentation into the context window at inference time. With something like Context7, the model fetches up-to-date reference material and uses that as context, not its compressed training data. You move knowledge from “maybe remembered” to “definitely present.”
All three patterns do the same thing: push the certainty floor higher. Context engineering IS certainty engineering. The better the context, the better the predictions.
The agent architecture analogy — LLM as CPU, context as RAM, tools as syscalls — and how it maps to building real systems.
Where uncertainty stays hard
Not all uncertainty yields to better tooling.
The dangerous combination: a system that has access to private data, exposure to untrusted content, and the ability to communicate externally. This is Simon Willison’s security trifecta, and prompt injection makes it fundamentally unsolved. You can’t train away the vulnerability because the model can’t reliably distinguish between instructions and data. That’s not a bug to be patched. It’s a structural property of systems that process natural language.
Fully autonomous pipelines with no human review are coming. Some are already here. And the failure mode shifts from hallucination — the model getting something wrong — to security: the model doing exactly what a malicious input told it to do, confidently and correctly.
This is where human judgment still matters most. Not because humans are better at parsing text, but because humans can evaluate whether an action should happen in context. The model can tell you what the next token is. It can’t tell you whether executing that action is a good idea given everything outside its context window.
Simon Willison’s appearance on the Lenny’s Podcast covers overlapping ground on agentic patterns, the dark factory concept, and the security trifecta. Worth the listen.
Calibrating trust
The question isn’t “can I trust the model.” It’s “what’s my verification strategy for this specific output.”
High-certainty zone. Common patterns, well-known APIs, standard implementations. Let the model run. Review the diff. The agent writes a React component using hooks you’ve used a hundred times — you skim it, approve it, move on. The cost of verification is low because your intuition is reliable here too.
Low-certainty zone. Unfamiliar libraries, complex business logic, anything touching security or data integrity. Tests, retrieval, human checkpoint. The agent generates a database migration — you read every line, check the rollback path, verify it against production schema. The cost of verification is higher, but so is the cost of getting it wrong.
The engineering skill is knowing which zone you’re in. That’s not something the model can tell you. It comes from understanding your system, your domain, and the specific shape of the problem. The model provides the pattern matching. You provide the judgment about when to trust the patterns.
The engineers who thrive with these tools aren’t the ones who trust everything or distrust everything. They’re the ones who’ve calibrated — who’ve developed an intuition for where the model is strong and where it’s guessing, and who’ve built verification strategies that match.