home
2025-01-0812 min read
LangGraphLLMMulti-AgentOrchestration

State as the API: LangGraph After Three Rewrites

The state schema is the most consequential design decision in LangGraph. Three iterations on how to model it — and why channels with reducers is the right primitive.

Why this matters

LangGraph tutorials show you the graph. What they don't show is what happens six weeks into a production multi-agent system when you've added fifteen fields to your state dict, three agents are writing to the same key with different assumptions, and a checkpoint replays a value that was valid two hours ago but is nonsense now.

I rewrote the state model for the same system three times in four months. Not because LangGraph changed its API — it didn't — but because my first two designs were wrong in ways I couldn't see until the system was running under real load. Each rewrite cost us a week of migration. The third design has survived eight months and four additional agents without touching the schema contract.

The state schema is the API between your agents. If you design it the way most tutorials suggest — as a mutable bag of values — you will pay for it. Here's what I learned.

1. The first iteration: flat dict and why it collapses

The first design looked perfectly reasonable. A plain TypedDict with a dozen keys, each agent reads what it needs, writes what it produces. Clean, simple, tutorial-approved.

class AgentState(TypedDict):
    messages: list[str]
    research_results: list[dict]
    draft: str
    critique: str
    iteration_count: int
    should_continue: bool

What broke: two agents writing to messages in parallel — one appending tool results, one appending user messages — produced undefined ordering. should_continue was set to True by the planner and immediately overwritten to False by the critic running one step later, because LangGraph merges node outputs sequentially and the last write wins.

The deeper issue: a flat dict with implicit write semantics turns every node into a potential race condition. It works in the happy path. It fails the moment two nodes touch the same field in the same step, or when a node assumes a field wasn't modified since it last read it.

The fix isn't defensive reading or node coordination. The fix is changing how state updates are applied — moving from overwrite semantics to explicit merge semantics per field. You can't bolt that onto a flat dict after the fact without migrating every node that touches state.

We tried patching it. We added a last_writer field to track who modified what last, we added node-level locks, we added a "claim" convention where a node would clear a field before writing to it. None of it worked reliably. The problem was the design, not the implementation.

2. The second iteration: nested Pydantic and the rigidity problem

My second attempt went the other direction. I wrapped everything in Pydantic models, nested them, and added validation everywhere.

class ResearchState(BaseModel):
    query: str
    sources: list[Source]
    confidence: float

class DraftState(BaseModel):
    content: str
    word_count: int
    critique_history: list[Critique]

class AgentState(TypedDict):
    research: ResearchState | None
    draft: DraftState | None
    final_output: str | None

This solved the write-conflict problem — each agent owned its section of the schema. But it created a different failure mode: conditional edge functions had to inspect research is not None and draft is not None and draft.critique_history[-1].approved before routing. Three weeks in, those edge predicates had become load-bearing business logic, tested nowhere, readable by nobody.

The other problem: None as a sentinel is semantically ambiguous in LangGraph. Does research: None mean "not started", "failed", or "intentionally skipped"? We were embedding workflow state into data state, which meant changing the workflow required changing the schema — a full migration each time.

When we added a fourth agent that needed to partially override the research results without discarding the sources, we had to add research_override: ResearchState | None because we couldn't safely mutate the existing field. The schema was accumulating fields not because the domain was growing, but because the design had no mechanism for safe partial updates.

Nested Pydantic is the right tool for tool I/O contracts. It is the wrong tool for LangGraph state when your graph structure is still evolving.

3. The third iteration: channels with reducers

The design that worked is the one LangGraph actually documents but most tutorials skip over: Annotated fields with reducer functions.

import operator
from typing import Annotated
from langgraph.graph import StateGraph

class AgentState(TypedDict):
    messages: Annotated[list[str], operator.add]      # append-only
    tool_calls: Annotated[list[ToolCall], operator.add]  # append-only
    current_draft: str                                 # last-write-wins (intentional)
    iteration: Annotated[int, lambda a, b: b]          # always take latest
    approved: bool | None                              # routing signal, write-once per step

What changed: messages and tool_calls use operator.add as their reducer — meaning each node's output is appended to the existing list, not replacing it. If two nodes both add a message in the same step, both messages are preserved, in graph-traversal order. There is no silent overwrite.

current_draft is plain — last-write-wins — because only the writer node ever touches it, and we want the latest value. The channel model makes the write semantics explicit and per-field, not implicit and global.

The shift in mental model is this: you're not designing a data structure, you're designing a message-passing contract. Each field is a channel with a rule for how to combine concurrent writes. Overwrite is a valid rule — it just has to be a deliberate choice, not the default.

This design has three properties the earlier ones lacked. First, it's safe under parallel node execution — the reducer handles merge without coordination logic in the nodes. Second, the schema communicates intent: Annotated[list, operator.add] tells every reader that this field accumulates — nobody will "fix" it by adding an overwrite. Third, the schema is stable under growth — adding a new agent means adding new message types to an append channel, not new top-level fields. The contract between existing agents doesn't change.

The migration from the Pydantic design took three days instead of a week. The clarity gain was immediate: every code review comment about state went from "is this safe?" to "is this the right reducer?".

4. Conditional edges are not routing logic

After fixing the state model, the next mistake I see consistently — in code reviews and in systems I'm brought in to audit — is treating conditional edge functions as the place to put business logic.

A conditional edge in LangGraph takes state and returns the name of the next node. That's it. One function, one responsibility: "given this state snapshot, which node runs next?"

# Wrong: business logic inside the router
def should_continue(state: AgentState) -> str:
    if state["iteration"] > 5:
        return "end"
    last_message = state["messages"][-1] if state["messages"] else ""
    if "approved" in last_message.lower() or "looks good" in last_message.lower():
        return "end"
    return "generate"

# Right: routing reads an explicit signal written by a node
def should_continue(state: AgentState) -> str:
    if state.get("approved") or state["iteration"] >= state["max_iterations"]:
        return "end"
    return "generate"

The failure mode: edge functions that do string parsing, database lookups, or multi-step logic are untestable in isolation and undebuggable in traces. When routing goes wrong — and it does go wrong, usually in a state combination your unit tests never covered — you can't replay the decision because the logic lived inside the edge, not written into state.

The rule I follow: if the routing condition involves more than reading a field and comparing a value, the condition belongs in a node that writes a routing signal to state. The edge reads the signal. State is always the source of truth, not the edge function.

A corollary: never route based on the content of a message. Route based on a field that a node explicitly wrote after interpreting the message. The interpreter node is where the ambiguity lives and where you can unit-test it. The edge function should be a lookup table.

5. Interrupts and the dual-write problem

Human-in-the-loop is one of LangGraph's strongest features. It's also where I've seen the most production incidents — not because the API is wrong, but because the design pattern most teams reach for is subtly broken.

The naive pattern: add an interrupt_before to the "send email" node, the graph pauses, your UI shows the pending action, the human clicks approve, you call graph.update_state() with the approval, execution resumes.

The dual-write problem: if your node performs an external side effect before the checkpoint, and the graph saves state after the node completes, replaying from the checkpoint will re-execute that side effect. Send an email, checkpoint, human rejects, retry — you've sent the email twice before the human's rejection is processed.

# Wrong: side effect before interrupt boundary
def send_draft_node(state: AgentState) -> dict:
    email_client.send(state["draft"], to=state["recipient"])  # side effect happens here
    return {"email_status": "sent"}  # checkpoint captures this — replaying from here re-sends

# Right: separate intent from execution
def prepare_send_node(state: AgentState) -> dict:
    # write the intent — no external call yet
    return {"pending_action": {"type": "email", "to": state["recipient"], "body": state["draft"]}}
    # graph interrupts here; human sees and approves pending_action

def execute_send_node(state: AgentState) -> dict:
    action = state.get("pending_action")
    if action and state.get("human_approved"):
        email_client.send(action["body"], to=action["to"])
    return {"pending_action": None, "human_approved": None}

The pattern: write the intent to state, interrupt, let the human approve or modify the intent, then execute. Replaying from the pre-execution checkpoint reruns the intent node — which has no side effects. The executor only runs if human_approved is set in state.

This is a specific case of the general rule: checkpoints capture state, not side effects. Design your nodes so that replaying from any checkpoint is safe. If a node has external side effects, those effects must be idempotent, or they must live after the final approval gate, or both. If neither is true, the side effect isn't safe to run inside the graph.

6. What I checkpoint and what I don't

The LangGraph checkpointer persists state at every node boundary by default. That's the safe default and the right starting point. It's also a cost model worth understanding before you scale.

In a graph that calls an LLM at every node, each step writes a checkpoint. For a 10-node graph with 100ms per write to a remote store, that's a second of overhead per trajectory. Under high concurrency that becomes visible latency. More importantly, it's unnecessary: you don't need to checkpoint compute-only nodes — you need to checkpoint before and after side effects.

The classification I use for every node before shipping:

  • Has external side effects (API call, DB write, email): always checkpoint before and after. Replay must be safe — design for idempotency or put an approval gate before the first execution.
  • Calls an LLM: checkpoint the inputs before the call. LLM calls are expensive and non-deterministic — if the node fails mid-call, you want to replay with the same inputs, not re-derive them.
  • Pure transform (parsing, formatting, filtering): skip the checkpoint. Replaying a JSON-to-TypedDict conversion is free and correct.
  • Routing node: no checkpoint needed. The routing decision is reproducible from state, which is already checkpointed.

Checkpointing a formatting node and replaying it on failure is harmless but wastes storage and adds latency. Not checkpointing a DB write node and replaying it on failure creates a duplicate write. Knowing which nodes have side effects is the same knowledge you need to design the graph correctly in the first place — so the checkpoint audit is also a correctness audit.

When I audit a LangGraph system: list all nodes, classify each (side effects / LLM call / pure transform / routing). Every node in the first two classes must have a checkpoint. Every node in the last two is a candidate to skip. If you can't classify a node, that's the thing to fix first.

The category frame

Three rewrites to the same system isn't a story about LangGraph's complexity. LangGraph is a well-designed tool. It's a story about the cost of treating state as an implementation detail instead of the primary interface.

In multi-agent systems, the state schema is the contract between agents. It determines how they can evolve independently, how you add a new agent without breaking existing ones, and how you debug a trajectory three weeks later when a customer files a support ticket. Get it wrong and every new agent becomes a migration risk. Get it right and it's the part of the system that doesn't need to change.

The three iterations reduce to a single rule: design state as a set of channels with explicit merge semantics, not a shared mutable object. Everything else — conditional edge functions, interrupt patterns, checkpoint granularity — follows from having that foundation correct.

If you're building or auditing a LangGraph-based system and running into state conflicts, fragile routing, or interrupt issues, write me. That's the class of problem I find most interesting.

share: