The Ghost Validator Problem

A month ago I put a customer-request agent into production for a small industrial company. Twelve people, electronic components, the kind of catalog where getting a spec wrong can cost real money very quickly. The system did retrieval over product docs, routed inquiries, and drafted replies. I added a “validate before send” button because caution is cheap compared with cleaning up a bad quote.

In the first two weeks, the team intercepted 23 replies that would have created problems. Some were technically correct and still wrong in practice. One answer sounded like it came from a legal department, sent to a client who had been on first-name terms with them for five years. Seven replies contained subtle product-reference mixups. The model confused MX-2847 with MX-2874, which looks trivial until you remember those two characters hide a €400 price difference. Two commercial proposals, in the sales director’s words, would have sent the client running.

That experience clarified the real question for me. Human review obviously helps. The hard part is deciding how far to push it before “oversight” becomes another bottleneck, or worse, a ritual that makes everyone feel responsible while nobody is really paying attention.

The button that made everyone feel safe

Most conversations about human-in-the-loop systems start from the same comforting idea: let the model do the work, let the person approve the result, and you keep safety without sacrificing speed. In clean diagrams, that sounds sensible. In a real company, the approval step changes shape almost immediately.

During the first week, everyone read every draft line by line. They checked part numbers, dates, pricing logic, and tone. After a hundred clean validations, their behavior changed. They still opened the reply, but they scanned instead of reading. After a few hundred, they were mostly checking whether the answer looked coherent in the broad sense. If it seemed plausible, they clicked approve.

That is the trap nobody likes to say out loud: the better the agent gets, the less carefully the human looks. Accuracy creates complacency. Success trains inattention.

I call it ghost validation. The approval still happens. The button still gets clicked. People can honestly tell themselves they reviewed the output. But the actual cognitive work of review has faded away. The system keeps the appearance of human supervision after the substance has started to disappear.

We saw it clearly at the start of week four. The agent drafted a reply describing a new feature that did not exist. It was a very elegant hallucination, the dangerous kind. It fit the rest of the product story, used the right vocabulary, and sounded consistent with previous updates. Someone approved it because it “made sense.” The client called back asking for a quote on the feature. That was an awkward phone call, and a useful one.

This is why human review can become security theater. It does not merely fail quietly. It creates the feeling that the system is protected, which makes the eventual miss harder to anticipate and harder to explain.

Training attention instead of teaching policy

My first attempt at training the team was exactly what you would expect from a well-meaning consultant. It was a 40-slide deck about best practices for validating AI outputs. There was a 12-point checklist. There were generic examples. It was tidy, complete, and almost entirely useless.

Nobody applied it consistently because it asked them to remember an abstract policy while doing real work under time pressure. Attention does not improve because you hand someone a framework. It improves when the work itself teaches their eyes what to look for.

What finally worked was much simpler. For one week, I deliberately planted errors in the agent’s outputs. Small errors, not cartoonish ones. A date shifted by one day. A product reference with two digits inverted. A sentence that was formally polite in a company that speaks casually to long-term clients. Then I timed how long it took reviewers to catch them, and who caught what.

It turned review into a search task instead of an administrative obligation. People started comparing notes. One person was good at spotting tone drift. Another had an instinct for suspicious part numbers. A third noticed when delivery promises felt too generous for their supply chain reality. The team became sharper because they had concrete misses to hunt, not principles to memorize.

That shift mattered more than the training material itself. Review is usually framed as a compliance task, which is why it becomes dull so quickly. Once people see it as pattern recognition inside their own domain, they engage differently. They are no longer checking a box. They are trying to catch a system that can sound convincing while being slightly wrong.

After that week, I reduced the guidance to a single A4 sheet. One page, six failure modes, all specific to that agent. Not “watch for hallucinations” in the abstract. More like: part-number transpositions, outdated lead times, false certainty when the documentation is thin, tone that reads too formal for legacy customers, cross-family substitutions that look plausible but break compatibility, and invented feature bundling in commercial replies. If you saw one of those patterns, you slowed down.

We also added a 15-minute review session every Friday. We looked at two or three edge cases from the week and argued about them. Would you have approved this? Why did this sound okay on first read? What detail changed the decision? Those discussions were more useful than any policy document because judgment is social. Teams need shared calibration. If one person approves anything that seems roughly correct and another rejects every sentence that is not perfect, the process becomes chaos with a nice interface.

The last piece was a dashboard that tracked interceptions. How many errors did each reviewer catch this week, and what kind? I was careful with the framing. Catching an error was not evidence that the agent failed. It was evidence that the whole system worked. If you want people to stay alert, you have to reward detection, not just speed.

What surprised me was how quickly this changed behavior. Once reviewers had examples, feedback, and a little status attached to good catches, the “approve” button stopped feeling ceremonial. It became a moment of real scrutiny again. That vigilance still decays over time, and it has to be refreshed, but now we were managing a human skill instead of assuming it would magically persist.

Review has to follow risk, not doctrine

If your agent makes a thousand decisions a day and a human must inspect every one, you have not automated much. You have created a new job category with all the glamour of airport security and less legroom.

So review has to be stratified. The common mistake is doing this conceptually rather than operationally. Teams classify tasks as “high value” or “low value,” which sounds thoughtful and usually means nothing. The useful question is more concrete: what happens if this specific action is wrong, and can we reverse it cheaply?

In practice, the categories look something like this:

Irreversible or high-impact actions deserve mandatory review every time. Payments, contractual commitments, data deletion, price changes, or anything that alters an external relationship in a lasting way should not go out uninspected.
Repetitive, low-risk actions usually work better with sampling. Let the agent act, then audit a percentage of outputs at random. That preserves vigilance without turning the whole operation into manual labor.
Reversible actions can often run with delayed oversight. An agent can send a low-stakes email or update an internal field if a human has a window to cancel, amend, or roll back the decision.

This sounds straightforward until someone reaches for confidence thresholds. Then you get the familiar sentence: anything below 85 percent goes to a human, anything above goes through automatically. The number looks scientific, which is part of the problem. Confidence is often a weak proxy for correctness, especially in the cases that matter most. A model can be highly confident while confusing two adjacent product lines. Retrieval confidence can tell you the documents were close, not that the answer was faithful. Tone errors and relationship mistakes barely show up in these scores at all.

I have seen teams spend weeks debating whether the right threshold is 85 or 90. Meanwhile, the biggest risks sat somewhere else entirely. A low-confidence answer to a common internal question might be harmless. A high-confidence commercial promise can damage a customer relationship in one click.

The right design usually has less to do with a universal rule than with reversibility, cost of error, and the kinds of mistakes the agent actually makes in your environment. That last phrase matters. Not the mistakes large models make in theory. The mistakes this agent makes with your documents, your workflows, your customers, and your constraints.

The bottlenecks are human before they are technical

On paper, human review sounds reassuring because it preserves control. In an organization, it also creates two very different reactions.

One group loves it immediately. The agent handles the repetitive drafting, and they keep the parts requiring taste, context, and judgment. For them, the system feels like power steering. Same destination, less strain.

The other group sees the shape of the future and does not like what it suggests. They hear, “You still approve everything,” and mentally append, “for now.” If your job starts to collapse into reviewing machine output, it is reasonable to wonder whether the company is training your replacement in public.

I do not think honesty helps by making this fear disappear. It helps by refusing to pretend the fear is irrational. In many workflows, the long-term direction is obvious. As agents improve, item-by-item approval will cover a shrinking share of cases. Telling people otherwise is insulting. What I can say truthfully is that today the system performs better with experienced humans inside it, and the best use of that experience is not blind approval. It is exception handling, context injection, and system correction.

There is another human problem that gets less attention: reviewing output all day is cognitively exhausting. Fifty documents that are 95 percent correct can drain someone faster than writing five from scratch. The brain has to stay alert without the natural momentum that comes from composing something yourself. Maximum concentration, minimal intrinsic interest. It is a rotten combination.

Then you get the rewrite trap. A draft is good enough to send, but a conscientious reviewer can see two sentences that could be better. So they improve them. If that happens every time, the person is no longer validating. They are co-authoring. On the SME project, one reviewer averaged eight minutes per approval. Eight minutes to validate an email the agent produced in three seconds. At that point, the automation was technically functioning and economically questionable.

Responsibility gets muddy too. If a person approves a machine-generated reply and it causes a problem, who owns the mistake? The reviewer, because they clicked yes? The manager who introduced the system? The vendor? The person who configured the prompts and routing? Companies dislike this ambiguity because responsibility tends to become very clear only after something goes wrong. Before that, everyone enjoys a pleasant fog.

Complexity is eroding the meaning of validation

There is a scene in Her where the AI begins evolving faster than the humans around it can really follow. The system remains useful, even intimate, but becomes less legible at the same time. I think about that whenever people talk about human oversight as if it guarantees understanding.

The agents I build now are already complex enough that “why did it produce this exact answer?” is often a layered question. Retrieval over internal documentation, reranking, multiple search passes, tool calls, prompt templates that change by context, temperature adjustments, policy guards, fallback logic, memory behavior. None of this is science fiction. It is standard implementation detail. And then we ask a salesperson or support rep to validate the output as if they are reviewing a junior colleague’s draft.

But what are they validating exactly? They can judge whether the final answer sounds right. They can compare it with local knowledge, relationship history, and domain specifics that never made it into the documentation. Those are real forms of judgment. What they usually cannot do is inspect the underlying path the system took. They are not verifying reasoning in any deep sense. They are evaluating a polished surface.

That distinction matters because it changes what “human in the loop” can honestly mean. For simpler systems, direct review of outputs is often enough. For more complex ones, output review becomes shallower while the machinery underneath grows denser. The human’s role shifts from understanding the process to spotting symptoms.

Sometimes that is still fine. We do not need airline passengers to understand fly-by-wire control systems before they board a plane. But the analogy breaks down quickly because those systems are engineered, tested, and regulated in ways most AI agents are not. In business settings, a lot of oversight still depends on ordinary employees making contextual judgments with partial information.

That is why I increasingly think the valuable human contribution is not “approve each answer.” It is maintaining the environment in which good answers are more likely. Humans decide which documents are authoritative. They define escalation rules. They spot recurring failure patterns and turn them into guardrails. They know when a long-time customer deserves a direct call rather than a beautifully written email. They recognize the difference between a technically acceptable statement and one that would quietly damage trust.

When people are reduced to rubber-stamping, the loop has already broken. The role looks safe because a person remains present. In practice, the person has lost leverage.

Oversight has to move up a level

I still use human review. I will keep using it where the cost of error justifies the friction. I have seen it prevent real damage. I have also watched it decay into ceremony faster than most teams expect.

The durable version is not “a human checks everything forever.” It is a layered system where human attention is spent where it changes outcomes: high-impact approvals, targeted audits, failure analysis, policy design, and domain corrections that the model cannot infer from documents alone. That makes review rarer at the item level and more serious when it happens.

The deeper question is no longer how far to push human-in-the-loop before letting go. It is where human judgment is genuinely additive, and where we are only preserving the optics of control. The distinction is easy to blur because both versions include a button and a person. Only one of them still contains real thought.