Shipping LLM agents that don't fall over

April 18, 20262 minute readBy vihanga

Hard-won notes on what separates a demo agent from one you can put in front of users.

Published: April 18, 2026
Reading: 2 min
Subject: AI · Agents · Engineering
Author: vihanga

The first agent I shipped to real users worked beautifully in our staging logs and then promptly tried to refund a customer for the wrong order. The model hadn't hallucinated — every individual step was correct. What broke was the seam: a tool returned an order ID as a string, another expected it as a number, and the agent dutifully converted by picking the first numeric run it saw.

That experience reshaped how I think about agent reliability. Almost every failure I've seen since has been a variant of the same problem.

The taxonomy of agent failure

There are really only four ways an agent fails in production, and only the first one is what most people talk about:

Reasoning errors — the model misunderstands the task.
Tool contract drift — the schema the model was trained against and the schema the tool actually exposes have diverged.
State leakage — context from a previous turn poisons the next decision.
Operator surprise — humans intervene in ways the agent's policy didn't anticipate.

Reasoning errors are the rarest of the four. The other three are where time goes.

Treating tools like APIs, not prompts

The single biggest lift in reliability came from a boring rename: I stopped calling them "tools" internally and started calling them functions with SLAs. That sounds like cosmetic semantics, but the discipline followed: latency budgets, error envelopes, structured failures, idempotency keys.

lib/agent/tool.ts

export interface Tool<TInput, TOutput> {
  name: string;
  description: string;
  input: z.ZodType<TInput>;
  output: z.ZodType<TOutput>;
  call: (input: TInput, ctx: ToolContext) => Promise<Result<TOutput, ToolError>>;
}

Wrapping every tool in a Result type, validating both ends with Zod, and threading a context object through the call site eliminated an entire class of bug. The agent can still misuse a tool — but it can no longer silently misuse one.

The 80/20 rule of agent observability

You don't need traces, evals, and replay infra on day one. You need one thing: a log line per tool call that records the input, output, latency, and which model turn produced it. That single line catches most of the bugs you'll see in the first month.

What I'd build differently

If I were starting over today, the first commit wouldn't be the agent loop. It would be the contract layer — typed tools, structured errors, and a deterministic replayer that can re-run any production trace from disk. The model is the part of the system you have the least control over. Everything around it should be the part you have the most.

The reward for getting this right isn't a smarter agent. It's a boring one — and in production, boring is the highest compliment.

vihanga

Software & AI engineer. Writes about the craft of shipping things that don't embarrass you in production. More about me.