# What Bank-Grade Key Management Teaches You About Agent Eval Harnesses

> Five disciplines from banking security — durable state, deterministic recovery, dual control, and audit trails — applied to LLM agent evaluation.

Published: 2026-04-18
Canonical: https://gianlucamazza.it/en/blog/bank-grade-agent-evals
Tags: Agent Evals, Verifiable Systems, LLM Production, Banking, MCP

## Why this matters

Many agent systems I see are evaluated like demos: a human reads a transcript, accepts the result, and ships. In regulated domains — banking, fintech, compliance, insurance — I would not ship that pattern. The useful question is not "does the agent sound right?" It is: "can I prove, after the fact, why it did what it did — and rerun the same inputs to reproduce it?"

I spent most of my pre-2024 years building wallet architectures, Lightning UI prototypes, and custodial solutions for banks. The mental model I kept was not "cryptography for its own sake". It was **systems designed so that any claim can be verified by someone who doesn't trust you**. That model maps closely to production LLM agents. It is also a gap I often see in teams coming from web-app engineering.

Below are five disciplines I carry over, with the agent-system mapping.

## 1. Durable state, not transient context

**Banking:** a key never lives only in memory. It exists in an HSM, a backup HSM, an encrypted shard set, and an audit log that records every time it was touched. I do not treat ephemeral state as trusted state.

**Agent parallel:** I persist agent state before any tool call that has side effects, not after. LangGraph's checkpointer is a practical baseline. Without durable state, I cannot answer "what did the agent know at step N?" — a question I often see in post-incident reviews.

```python
# Not this:
result = agent.invoke(input)  # state only in RAM

# This:
checkpoint = graph.invoke(
    input,
    config={"configurable": {"thread_id": session_id}},  # persisted
)
```

Cost: added latency and storage per step. Benefit: I can replay, audit, and debug every trajectory.

## 2. Deterministic failure modes

**Banking:** key operations either succeed or fail cleanly. I do not want a partial success path. A partial signature is a security incident.

**Agent parallel:** I make every tool contract expose typed, exhaustive failure modes. No `except Exception: pass`. No "just let the LLM retry" without explicit termination. I use Pydantic schemas for every tool I/O, with union types for failure cases the LLM is expected to handle:

```python
class SearchResult(BaseModel):
    status: Literal["ok", "rate_limited", "empty", "auth_error"]
    hits: list[Hit] = []
    retry_after_s: int | None = None
```

The LLM can reason about `status: rate_limited` → wait. It cannot reason reliably about an unstructured exception trace.

## 3. Dual control for high-stakes actions

**Banking:** moving money above a threshold requires two authorizations. No single key, and no single human, can do it alone.

**Agent parallel:** I treat any action with financial, legal, or customer-facing impact as requiring a **human-in-the-loop checkpoint** or a separate validation agent with fixed rules — not the same agent that proposed it. I often see this gap in enterprise agent deployments. Teams ship agents that auto-send email, auto-commit code, or auto-file tickets without a second signature.

LangGraph's `interrupt` primitive is the mechanism I use. I use it for external communications, writes to systems of record, and anything that affects external systems or customer-visible state.

## 4. Audit trails as evidence, not logs

**Banking:** every action produces evidence that would be useful in a regulator review. That means signed, timestamped, and tamper-evident — not just "there are logs".

**Agent parallel:** I want the eval harness and the production trace store to share the same schema and evidence model over time. I want every agent run to write: input, prompt template hash, model name + version, tool invocations with arguments and returns, final output, and a chain-of-custody timestamp. I have used OpenTelemetry + LangSmith in some teams as a workable baseline. I have also found a Postgres table with `jsonb` columns and an append-only constraint workable in narrower setups. I do not treat Slack transcripts as a reliable evidence store.

Why it matters: regressions in LLM systems can be hard to see in aggregate metrics but visible in specific traces. Without replayable evidence, I am left with guesswork during diagnosis.

## 5. Recovery playbooks for failure, not just success

**Banking:** every custodial service has a runbook for key compromise, key loss, and operator compromise. I want the playbook written before the incident, not after.

**Agent parallel:** before I ship, I write down the failure modes I expect to see in production and the recovery action for each:

- Model deprecated / rate-limited → fallback model routing
- Tool returning malformed output → circuit breaker + escalation
- Prompt injection detected → trajectory abort + audit entry
- State store unavailable → read-only mode, not silent degradation

If I cannot name three plausible failure modes with written recovery paths, I would not ship the system.

## The term I use

I call this **verifiable agent systems for regulated domains**. It is not "evals" (Hamel Husain owns that framing well, and I recommend his work). It is narrower: agent systems engineered from day one around constraints that banking custody also imposes — durability, determinism, dual control, evidentiary audit, recovery.

My view is that some enterprise LLM work will move into this category as lower-risk use cases, such as customer support bots and internal knowledge Q&A, become well covered by horizontal platforms. For workflows where errors have material impact, I change the evaluation question from "does it sound right" to "can I prove why it did what it did".

If you're evaluating an agent system against those five disciplines and finding gaps, [write to me](#contact).

## FAQ

### Why persist agent state before side-effecting tool calls?

I persist agent state before any tool call that has side effects because otherwise I cannot answer what the agent knew at step N. Durable state adds latency and storage per step, but it makes replay, audit, and trajectory debugging possible after an incident.

### How should tool failures be represented for an LLM agent?

I make every tool contract expose typed, exhaustive failure modes instead of unstructured exceptions or silent catches. With a schema such as a status field for ok, rate_limited, empty, or auth_error, I give the LLM expected cases it can handle, such as waiting after rate limiting.

### When should an agent require human-in-the-loop approval?

I require a human-in-the-loop checkpoint or a separate validation agent with fixed rules for actions with financial, legal, or customer-facing impact. For me, that includes external communications, writes to systems of record, and anything that affects external systems or customer-visible state.

### What evidence should an agent run record for auditability?

I want every run to record the input, prompt template hash, model name and version, tool invocations with arguments and returns, final output, and a chain-of-custody timestamp. I do not rely on logs alone; I want the trace to support replayable evidence for diagnosis.

### What recovery paths should exist before production release?

Before I ship, I write down plausible production failure modes and the recovery action for each. Examples include fallback model routing for deprecation or rate limits, a circuit breaker and escalation for malformed tool output, trajectory abort on prompt injection, and read-only mode when the state store is unavailable.