home
2026-04-185 min read
Agent EvalsVerifiable SystemsLLM ProductionBankingMCP

What Bank-Grade Key Management Teaches You About Agent Eval Harnesses

Five disciplines from banking security — durable state, deterministic failure, dual control, audit trails, and recovery playbooks — applied to LLM agent evaluation.

Why this matters

Most agent systems shipped today are evaluated the way demos are evaluated: a human looks at a transcript, shrugs, ships. In regulated domains — banking, fintech, compliance, insurance — that pattern is unshippable. The question is not "does the agent sound right?" but "can we prove, after the fact, why it did what it did — and rerun the same inputs to reproduce it?"

I spent most of my pre-2024 years building wallet architectures, Lightning UI prototypes, and custodial solutions for banks. The mental model I ended up with was not "cryptography for its own sake" but systems designed so that any claim can be verified by someone who doesn't trust you. That mindset translates almost 1-to-1 to production LLM agents, and it's the gap I see most often in teams coming from web-app engineering.

Below are five disciplines I carry over, with the agent-system mapping.

1. Durable state, not transient context

Banking: a key never lives only in memory. It exists in an HSM, a backup HSM, an encrypted shard set, and an audit log that records every time it was touched. Nothing ephemeral is trusted.

Agent parallel: agent state must be persisted before any tool call that has side effects, not after. LangGraph's checkpointer is the minimum bar. Without durable state you can't answer "what did the agent know at step N?" — which is the first question every post-incident review asks.

# Not this:
result = agent.invoke(input)  # state only in RAM

# This:
checkpoint = graph.invoke(
    input,
    config={"configurable": {"thread_id": session_id}},  # persisted
)

Cost: ~100ms per step. Benefit: you can replay, audit, and debug every trajectory.

2. Deterministic failure modes

Banking: key operations either succeed or fail cleanly. There is no "kind of succeeded" path. A partial signature is a security incident.

Agent parallel: every tool contract must have typed, exhaustive failure modes. No except Exception: pass. No "just let the LLM retry" without explicit termination. I use Pydantic schemas for every tool I/O, with union types for failure cases the LLM is expected to handle:

class SearchResult(BaseModel):
    status: Literal["ok", "rate_limited", "empty", "auth_error"]
    hits: list[Hit] = []
    retry_after_s: int | None = None

The LLM can reason about status: rate_limited → wait. It cannot reason about a half-cooked exception trace.

3. Dual control for high-stakes actions

Banking: moving money above a threshold requires two authorizations. No single key, no single human, can do it alone.

Agent parallel: any action with financial, legal, or customer-facing impact must go through a human-in-the-loop checkpoint OR a deterministic validator agent — not the same agent that proposed it. This is the single biggest gap in enterprise agent deployments I audit. Teams ship agents that auto-send email, auto-commit code, auto-file tickets without a second signature.

LangGraph's interrupt primitive is the mechanism. Use it on: external communications, writes to systems of record, anything reversible only by apology.

4. Audit trails as evidence, not logs

Banking: every action produces evidence that would hold up in a regulator's office. That means signed, timestamped, tamper-evident — not just "we have logs".

Agent parallel: your eval harness and your production trace store are the same artifact, rotated through time. Every agent run should write: input, prompt template hash, model name + version, tool invocations with arguments and returns, final output, and a chain-of-custody timestamp. OpenTelemetry + LangSmith is adequate. A postgres table with jsonb columns and an append-only constraint is adequate. Grep over transcripts in Slack is not.

Why it matters: regressions in LLM systems are almost always invisible in aggregate metrics but obvious in specific traces. Without replayable evidence you diagnose by vibe.

5. Recovery playbooks for failure, not just success

Banking: every custodial service has a runbook for key compromise, key loss, operator compromise. The playbook is written before the incident, not after.

Agent parallel: before shipping, write down the failure modes you'll actually see in production and the recovery action for each:

  • Model deprecated / rate-limited → fallback model routing
  • Tool returning malformed output → circuit breaker + escalation
  • Prompt injection detected → trajectory abort + audit entry
  • State store unavailable → read-only mode, not silent degradation

If you can't name three plausible failure modes with written recovery paths, you are not ready for production.

The category frame

I call this verifiable agent systems for regulated domains. It's not "evals" (Hamel Husain owns that framing well, and I recommend his work). It's narrower and more specific: agent systems engineered from day one around the same constraints that banking custody imposes — durability, determinism, dual control, evidentiary audit, recovery.

It's where I think the next wave of enterprise LLM adoption will sit, because the obvious low-hanging fruit (customer support bots, internal knowledge Q&A) is largely won by horizontal platforms. The remaining budget moves into workflows where the blast radius of a bad action is real — which means the evaluation bar moves from "does it sound right" to "can we prove why it did what it did".

If you're evaluating an agent system against those five disciplines and finding gaps, write me. Happy to talk.

share: