Everyone is still trying to prompt their agents into reliability. The tooling quietly moved the other way last week

LangChain shipped interpreter skills. Agents now run real code modules, not just prompt instructions.

https://www.langchain.com/blog/interpreter-skills

Deepagents added a middleware that grades the agent's own work against a rubric.

The pattern under both is the same.

Stop asking the model to behave. Start giving it behavior it cannot get wrong.

A prompt is a request. Every response is probabilistic.

Code is a guarantee. The same input returns the same output every time.

I moved a piece of agent behavior out of the prompt and into a script, and the accuracy went up, because the output became deterministic.

That is the whole trade, and it is not free.

Writing your own logic is expensive. Asking the model is cheaper, and less accurate.

So the real question on every step is which one you can afford to get wrong: • if a step must be correct, encode it as code • if a step needs judgment, leave it to the prompt • if a step is right most of the time, wrap the prompt in a check

Skills that execute code is a bigger update than it looked. It moves the reliable part of an agent out of language and into logic.

Prompts make an agent sound right.

Code makes it be right.

#ai #agentic-ai #llm

Everyone is still trying to prompt their agents into reliability. The tooling quietly moved the other way last week

Comments

More from this blog

Codex says 20 percent of its users aren't developers. That number is the whole story

The reliable agents - all do the same boring thing: they check their own work

An AI agent reverse-engineered malware that most antivirus tools missed

Every code assistant you use predicts one token at a time. Google shipped one that does not

Anthropic shipped two models on June 9. The one you cannot use is the more interesting

Command Palette

Comments

More from this blog