The reliable agents - all do the same boring thing: they check their own work

Two of the most useful agent releases share one unglamorous idea.

Verification is a step you build, not a property you hope for.

LangChain shipped Rubrics for Deep Agents, structured criteria an agent uses to evaluate its own output and correct it before returning. https://www.langchain.com/blog/introducing-rubrics-for-deepagents

Harvey, with LangChain Labs, detailed how to make verifiers for legal agents cheap enough to run at scale. https://www.langchain.com/blog/designing-efficient-verifiers-for-legal-agents

Different angles. Same pattern.

Most people still chase reliability by reaching for a bigger model.

They assume the next frontier release will finally stop the agent from confidently producing wrong answers.

It will not.

A single forward pass has no idea whether it just succeeded or failed.

The fix is structural, not magical. You add a step whose only job is to judge the work:
• define what a correct output looks like
• check the output against that definition
• send failures back for another pass
• only then trust the result

This is the difference between an agent that demos and an agent you can put in front of a user.

And the cost objection is fading. Harvey drove verification cost down by an order of magnitude by batching checks and using open models.

Cheap verification is what makes the pattern practical.

Reliability is not a smarter model.

It is a system built to check its own work.

The reliable agents - all do the same boring thing: they check their own work

Comments

More from this blog

An AI agent reverse-engineered malware that most antivirus tools missed

Every code assistant you use predicts one token at a time. Google shipped one that does not

Anthropic shipped two models on June 9. The one you cannot use is the more interesting

Two vendors named the same thing. The agent isn't the model. It's the harness

Command Palette

Comments

More from this blog