Bank-Grade Key Management Lessons for Agent Evals

Why this matters

Many agent systems I see are evaluated like demos: a human reads a transcript, accepts the result, and ships. In regulated domains — banking, fintech, compliance, insurance — I would not ship that pattern. The useful question is not "does the agent sound right?" It is: "can I prove, after the fact, why it did what it did — and rerun the same inputs to reproduce it?"

I spent most of my pre-2024 years building wallet architectures, Lightning UI prototypes, and custodial solutions for banks. The mental model I kept was not "cryptography for its own sake". It was systems designed so that any claim can be verified by someone who doesn't trust you. That model maps closely to production LLM agents. It is also a gap I often see in teams coming from web-app engineering.

Below are five disciplines I carry over, with the agent-system mapping.

1. Durable state, not transient context

Banking: a key never lives only in memory. It exists in an HSM, a backup HSM, an encrypted shard set, and an audit log that records every time it was touched. I do not treat ephemeral state as trusted state.

Agent parallel: I persist agent state before any tool call that has side effects, not after. LangGraph's checkpointer is a practical baseline. Without durable state, I cannot answer "what did the agent know at step N?" — a question I often see in post-incident reviews.

# Not this:
result = agent.invoke(input)  # state only in RAM

# This:
checkpoint = graph.invoke(
    input,
    config={"configurable": {"thread_id": session_id}},  # persisted
)

Cost: added latency and storage per step. Benefit: I can replay, audit, and debug every trajectory.

2. Deterministic failure modes

Banking: key operations either succeed or fail cleanly. I do not want a partial success path. A partial signature is a security incident.

Agent parallel: I make every tool contract expose typed, exhaustive failure modes. No except Exception: pass. No "just let the LLM retry" without explicit termination. I use Pydantic schemas for every tool I/O, with union types for failure cases the LLM is expected to handle:

class SearchResult(BaseModel):
    status: Literal["ok", "rate_limited", "empty", "auth_error"]
    hits: list[Hit] = []
    retry_after_s: int | None = None

The LLM can reason about status: rate_limited → wait. It cannot reason reliably about an unstructured exception trace.

3. Dual control for high-stakes actions

Banking: moving money above a threshold requires two authorizations. No single key, and no single human, can do it alone.

Agent parallel: I treat any action with financial, legal, or customer-facing impact as requiring a human-in-the-loop checkpoint or a separate validation agent with fixed rules — not the same agent that proposed it. I often see this gap in enterprise agent deployments. Teams ship agents that auto-send email, auto-commit code, or auto-file tickets without a second signature.

LangGraph's interrupt primitive is the mechanism I use. I use it for external communications, writes to systems of record, and anything that affects external systems or customer-visible state.

4. Audit trails as evidence, not logs

Banking: every action produces evidence that would be useful in a regulator review. That means signed, timestamped, and tamper-evident — not just "there are logs".

Agent parallel: I want the eval harness and the production trace store to share the same schema and evidence model over time. I want every agent run to write: input, prompt template hash, model name + version, tool invocations with arguments and returns, final output, and a chain-of-custody timestamp. I have used OpenTelemetry + LangSmith in some teams as a workable baseline. I have also found a Postgres table with jsonb columns and an append-only constraint workable in narrower setups. I do not treat Slack transcripts as a reliable evidence store.

Why it matters: regressions in LLM systems can be hard to see in aggregate metrics but visible in specific traces. Without replayable evidence, I am left with guesswork during diagnosis.

5. Recovery playbooks for failure, not just success

Banking: every custodial service has a runbook for key compromise, key loss, and operator compromise. I want the playbook written before the incident, not after.

Agent parallel: before I ship, I write down the failure modes I expect to see in production and the recovery action for each:

Model deprecated / rate-limited → fallback model routing
Tool returning malformed output → circuit breaker + escalation
Prompt injection detected → trajectory abort + audit entry
State store unavailable → read-only mode, not silent degradation

If I cannot name three plausible failure modes with written recovery paths, I would not ship the system.

The term I use

I call this verifiable agent systems for regulated domains. It is not "evals" (Hamel Husain owns that framing well, and I recommend his work). It is narrower: agent systems engineered from day one around constraints that banking custody also imposes — durability, determinism, dual control, evidentiary audit, recovery.

My view is that some enterprise LLM work will move into this category as lower-risk use cases, such as customer support bots and internal knowledge Q&A, become well covered by horizontal platforms. For workflows where errors have material impact, I change the evaluation question from "does it sound right" to "can I prove why it did what it did".

If you're evaluating an agent system against those five disciplines and finding gaps, write to me.

FAQ

Why persist agent state before side-effecting tool calls?

I persist agent state before any tool call that has side effects because otherwise I cannot answer what the agent knew at step N. Durable state adds latency and storage per step, but it makes replay, audit, and trajectory debugging possible after an incident.

How should tool failures be represented for an LLM agent?

I make every tool contract expose typed, exhaustive failure modes instead of unstructured exceptions or silent catches. With a schema such as a status field for ok, rate_limited, empty, or auth_error, I give the LLM expected cases it can handle, such as waiting after rate limiting.

When should an agent require human-in-the-loop approval?

I require a human-in-the-loop checkpoint or a separate validation agent with fixed rules for actions with financial, legal, or customer-facing impact. For me, that includes external communications, writes to systems of record, and anything that affects external systems or customer-visible state.

What evidence should an agent run record for auditability?

I want every run to record the input, prompt template hash, model name and version, tool invocations with arguments and returns, final output, and a chain-of-custody timestamp. I do not rely on logs alone; I want the trace to support replayable evidence for diagnosis.

What recovery paths should exist before production release?

Before I ship, I write down plausible production failure modes and the recovery action for each. Examples include fallback model routing for deprecation or rate limits, a circuit breaker and escalation for malformed tool output, trajectory abort on prompt injection, and read-only mode when the state store is unavailable.

What Bank-Grade Key Management Teaches You About Agent Eval Harnesses