State as the API: LangGraph After Three Rewrites
The state schema is the most consequential decision in LangGraph. Three iterations on modeling it, and why channels with reducers are the right primitive.
Why this matters
LangGraph tutorials usually show the graph. They often do not show what happens weeks into a production multi-agent system, after the state dict has grown, multiple agents write to the same key, and a checkpoint replays a value that was valid earlier but is invalid now.
I rewrote the state model for the same system three times in four months. LangGraph did not change its API. My first two designs were wrong in ways I only saw under real load.
The third design has held through additional agents without changing the schema contract. The lesson was direct: the state schema is the API between agents. If I treat it as a mutable bag of values, I create migration and recovery problems later.
1. The first iteration: flat dict and why it collapses
The first design looked reasonable. I used a plain TypedDict with a dozen keys. Each agent read what it needed and wrote what it produced. It was clear, simple, and common in tutorials.
class AgentState(TypedDict):
messages: list[str]
research_results: list[dict]
draft: str
critique: str
iteration_count: int
should_continue: bool
What broke: two agents wrote to messages in parallel — one appending tool results, one appending user messages — and the ordering was not explicit. should_continue was set to True by the planner and then overwritten to False by the critic running one step later, because LangGraph merges node outputs sequentially and the last write wins.
The deeper issue was the implicit write model. A flat dict makes every shared field a possible race condition. It works on the happy path. It fails when two nodes touch the same field in the same step, or when a node assumes a field has not changed since it last read it.
I tried to patch the flat design with a last_writer field, node-level locks, and a convention where a node cleared a field before writing to it. Those patches made the graph harder to reason about without fixing the contract. I needed field-level merge semantics instead of overwrite-by-default.
2. The second iteration: nested Pydantic and the rigidity problem
My second attempt moved in the opposite direction. I wrapped the state in nested Pydantic models and added validation at each boundary.
class ResearchState(BaseModel):
query: str
sources: list[Source]
confidence: float
class DraftState(BaseModel):
content: str
word_count: int
critique_history: list[Critique]
class AgentState(TypedDict):
research: ResearchState | None
draft: DraftState | None
final_output: str | None
This solved some write-conflict problems because each agent owned a section of the schema. It also moved complexity into conditional edges. Routing functions had to inspect nested state such as research is not None and draft is not None and draft.critique_history[-1].approved before deciding where to go next.
Those edge predicates became load-bearing business logic. They were hard to test in isolation and hard for reviewers to reason about.
The other problem: None is an ambiguous sentinel in LangGraph state. Does research: None mean "not started", "failed", or "intentionally skipped"? I was embedding workflow state into data state. Changing the workflow meant changing the schema, which made migrations more expensive.
When I added another agent that needed to partially override research results without discarding sources, I added research_override: ResearchState | None because I could not safely mutate the existing field. The schema was growing because the update model was too rigid, not because the domain required more concepts.
Nested Pydantic is a good fit for tool I/O contracts. For LangGraph state in a graph whose structure is still evolving, I found it too rigid.
3. The third iteration: channels with reducers
The design that worked is the one LangGraph documents but many tutorials skip: Annotated fields with reducer functions.
import operator
from typing import Annotated
from langgraph.graph import StateGraph
class AgentState(TypedDict):
messages: Annotated[list[str], operator.add] # append-only
tool_calls: Annotated[list[ToolCall], operator.add] # append-only
current_draft: str # last-write-wins (intentional)
iteration: Annotated[int, lambda a, b: b] # always take latest
approved: bool | None # routing signal, write-once per step
What changed: messages and tool_calls use operator.add as their reducer, so node outputs are appended instead of replacing the existing list. If two nodes add a message in the same step, both messages are preserved in graph-traversal order. There is no silent overwrite.
current_draft is still last-write-wins because only the writer node touches it, and I want the latest value. The point is not to ban overwrites. The point is to make each field's write behavior intentional.
That changed my mental model. I was not designing a data structure; I was designing a message-passing contract. Each field became a channel with a rule for combining writes.
This gave me three properties the earlier designs lacked. Parallel node execution was safer because reducers handled merge behavior without coordination logic inside nodes. The schema communicated intent: Annotated[list, operator.add] tells a reader that the field accumulates. The schema also became more stable under growth, because adding an agent often meant adding message types to an append channel instead of adding new top-level fields.
In that project, the migration from the Pydantic design was shorter than the earlier rewrites. Code review also became more concrete: instead of broad questions like "is this safe?", reviewers could ask "is this the right reducer?"
4. Conditional edges are not routing logic
After fixing the state model, the next mistake I often see is putting business logic inside conditional edge functions.
A conditional edge in LangGraph takes state and returns the name of the next node. I keep that responsibility narrow: given this state snapshot, which node runs next?
# Wrong: business logic inside the router
def should_continue(state: AgentState) -> str:
if state["iteration"] > 5:
return "end"
last_message = state["messages"][-1] if state["messages"] else ""
if "approved" in last_message.lower() or "looks good" in last_message.lower():
return "end"
return "generate"
# Right: routing reads an explicit signal written by a node
def should_continue(state: AgentState) -> str:
if state.get("approved") or state["iteration"] >= state["max_iterations"]:
return "end"
return "generate"
The failure mode: edge functions that parse strings, perform database lookups, or contain multi-step logic are hard to test and hard to debug in traces. When routing goes wrong, the decision is difficult to replay if the relevant logic lived only inside the edge and was not written into state.
The rule I follow is simple: if a routing condition needs more than reading a field and comparing a value, I put that condition in a node. The node writes an explicit routing signal to state, and the edge reads that signal. In my designs, state is the source of truth, not the edge function.
I also avoid routing directly on message content when I can. I route on a field written by a node after it interprets the message. The interpreter node is where the ambiguity belongs, and it is the part I can unit-test. The edge function stays close to a lookup table.
5. Interrupts and the dual-write problem
Human-in-the-loop is one of LangGraph's useful features. It is also where I have seen production incidents. The issue is usually not the API. It is the design pattern around side effects and checkpoints.
The naive pattern is straightforward: add an interrupt_before to the "send email" node, pause the graph, show the pending action in the UI, collect approval, call graph.update_state() with the approval, and resume execution.
The dual-write problem: if a node performs an external side effect before the checkpoint, and the graph saves state after the node completes, replaying from the checkpoint can re-execute that side effect. Send an email, checkpoint, human rejects, retry — the email may already have been sent before the rejection is processed.
# Wrong: side effect before interrupt boundary
def send_draft_node(state: AgentState) -> dict:
email_client.send(state["draft"], to=state["recipient"]) # side effect happens here
return {"email_status": "sent"} # checkpoint captures this — replaying from here re-sends
# Right: separate intent from execution
def prepare_send_node(state: AgentState) -> dict:
# write the intent — no external call yet
return {"pending_action": {"type": "email", "to": state["recipient"], "body": state["draft"]}}
# graph interrupts here; human sees and approves pending_action
def execute_send_node(state: AgentState) -> dict:
action = state.get("pending_action")
if action and state.get("human_approved"):
email_client.send(action["body"], to=action["to"])
return {"pending_action": None, "human_approved": None}
The pattern I use is: write the intent to state, interrupt, let the human approve or modify the intent, then execute. Replaying from the pre-execution checkpoint reruns the intent node, which has no external side effects. The executor runs only if human_approved is present in state.
This is a specific case of a broader rule: checkpoints capture state, not side effects. I design nodes so replaying from a checkpoint is safe. If a node has external side effects, those effects need idempotency, an approval gate before execution, or both. If neither is true, I do not treat that side effect as safe to run inside the graph.
6. What I checkpoint and what I don't
The LangGraph checkpointer persists state at every node boundary by default. I treat that as the safe starting point. It is also a cost model I want to understand before scaling a graph.
In a graph that calls an LLM at every node, each step writes a checkpoint. In one graph I worked on, remote checkpoint writes added overhead that I could see in traces. Under higher concurrency, that overhead became easier to notice. More importantly, it was not always necessary. I did not need durable state around every compute-only node. I needed it around side effects and expensive, non-deterministic calls.
When I review checkpoint placement, I usually end up with four classes:
- Has external side effects (API call, DB write, email): I normally checkpoint before and after. Replay must be safe, so I design for idempotency or put an approval gate before the first execution.
- Calls an LLM: I checkpoint the inputs before the call. LLM calls are expensive and non-deterministic, so if the node fails mid-call, I want to replay with the same inputs instead of re-deriving them.
- Pure transform (parsing, formatting, filtering): I usually skip the checkpoint. Replaying a JSON-to-TypedDict conversion is deterministic and cheap.
- Routing node: I usually do not add a checkpoint. The routing decision is reproducible from state, which is already checkpointed.
The difference matters. Checkpointing a formatting node and replaying it on failure is usually harmless, but it wastes storage and can add latency. Not checkpointing a DB write node and replaying it on failure can create a duplicate write.
The checkpoint audit is also a correctness audit. Before I decide what needs durable state, I identify which nodes have side effects, which nodes are non-deterministic, and which nodes are pure transforms.
When I audit a LangGraph system, I list all nodes and classify each one: side effect, LLM call, pure transform, or routing. In my practice, nodes in the first two classes need checkpoint coverage. Nodes in the last two classes are candidates to skip. If I cannot classify a node, I fix that first.
The category frame
Three rewrites to the same system is not a story about LangGraph being complex. LangGraph gives me the primitives I need for durable state and replay. The story is about the cost of treating state as an implementation detail instead of the primary interface.
In a multi-agent system, the state schema is the contract between agents. It determines how agents evolve independently, how I add a new agent without breaking existing ones, and how I debug a trajectory later. If the state contract is unstable, every new agent becomes a migration risk. If the contract is explicit, it can become one of the least volatile parts of the system.
The three iterations reduce to one rule I now use: design state as channels with explicit merge semantics, not as a shared mutable object. Conditional edge functions, interrupt patterns, deterministic recovery, and checkpoint granularity are easier to reason about from that foundation.
If you're building or auditing a LangGraph system and hitting state conflicts, fragile routing, or interrupt issues, write me. That's the class of problem I find most interesting.
FAQ
How should I model shared LangGraph state between agents?
I model shared state as channels with explicit merge semantics, not as a shared mutable object. In practice, that means using reducers where needed, such as append-only lists, and keeping overwrite behavior only where it is deliberate and owned by a specific writer.
Why can a flat TypedDict state fail under parallel node execution?
A flat dict makes every shared field a possible race condition. If two nodes write the same key in the same step, LangGraph merges outputs sequentially and the last write wins. I saw this with shared messages and routing flags that were silently overwritten.
When are nested Pydantic models too rigid for LangGraph state?
I found nested Pydantic useful for tool I/O contracts, but too rigid for evolving graph state. Conditional edges had to inspect nested data, None became an ambiguous sentinel, and partial overrides created new schema fields because the update model was not flexible enough.
What belongs in a LangGraph conditional edge function?
I keep conditional edges narrow: given this state snapshot, which node runs next? If a condition requires parsing, lookups, or multi-step logic, I put that logic in a node, write an explicit routing signal to state, and let the edge read that signal.
How do I avoid duplicate side effects with LangGraph interrupts?
I separate intent from execution. A node writes the pending action to state without making the external call, the graph interrupts for human approval or modification, and a later executor runs only if approval is present. Checkpoints capture state, not side effects.
Related articles
Where CrewAI Breaks in Production — and What to Use Instead
The role abstraction in CrewAI works for demos and struggles under production load. Four specific failure modes and the LangGraph patterns that replaced them.
Jan 15, 202510 min read#CrewAI#Multi-Agent#LangGraph#ProductionWhy Shared State Breaks Multi-Agent Systems Past Three Agents
Shared blackboards work in demos and fail under coordination load — the failure modes of shared state, and why message-passing with a supervisor wins.
Dec 10, 202410 min read#LangChain#Agents#Multi-Agent#ArchitectureRAG in Production: Fix Chunking and Re-Ranking Before Touching Embeddings
Most RAG pipelines fail on chunking or re-ranking before embedding quality. A diagnostic-first framework for finding and fixing the right bottleneck.
Dec 20, 202412 min read#RAG#Retrieval#LLM#Production