Why Shared State Breaks Multi-Agent Systems Past Three Agents
Shared blackboards work in demos and fail under coordination load — the failure modes of shared state, and why message-passing with a supervisor wins.
Why this matters
The first multi-agent system I built used a shared state dict. Three agents — a planner, a researcher, and a writer — all read from and wrote to the same LangGraph AgentState. It worked in demos. Two weeks later, with five agents and concurrent execution in production, it failed in ways that were difficult to debug.
Not crashes. Worse: subtly wrong outputs. The planner would overwrite a research result while the writer was still consuming it. An agent would read stale state from a previous session because I had not isolated the checkpoint. The supervisor would route to an agent that had already completed its task and was waiting for its own output, which had been overwritten by a parallel agent.
Shared state is not wrong. I stop trusting it when several agents coordinate through overlapping writes. The important question is where that model starts to break, and what I replace it with when I need durable state, deterministic recovery, reproducible evals, and cost under control.
1. The three failure modes of shared blackboards
A shared blackboard is any design where multiple agents read from and write to the same state object without explicit ownership. This includes LangGraph's flat TypedDict without reducer annotations, shared Python dicts passed by reference, and database tables without row-level locking.
Failure mode 1: write conflicts. When two agents write the same field, the merge rule decides the outcome, and the default rarely matches what I want. With no reducer, two writes to the same key inside one parallel super-step do not silently merge — LangGraph raises InvalidUpdateError, which shows up in production as a crash rather than a wrong answer. The quieter failure is cross-step: a later node overwrites a field an earlier node set, and unless I inspect state at each checkpoint that lost write is invisible. Either way it is the coordination model, not a bug, that forces the choice.
Failure mode 2: stale reads. Agent B reads state that Agent A wrote two steps ago. This can be valid in a sequential pipeline. In a parallel pipeline, it is a correctness bug. If Agent B assumes the researcher's output is current but the planner has since changed the research direction, Agent B answers a question that is no longer being asked.
Failure mode 3: phantom state. Checkpointing a multi-agent system means saving shared state at a point in time. If Agent C reads state, does work, and is checkpointed, then Agent A overwrites the same fields, replaying from Agent C's checkpoint restores a state version that Agent A had already moved past. The replay is internally consistent but externally wrong.
I do not treat these as LangGraph bugs or checkpointer bugs. They follow from the shared-state coordination model. If I add locks, versioned fields, and read-your-writes guarantees around shared state, I start to rebuild parts of a distributed database. That is usually the wrong primitive for an agent system.
2. The message bus alternative
A message bus changes the coordination model. Instead of reading from a shared object, agents send typed messages to one another through an explicit channel. No agent reads another agent's "current output" directly. It receives a message that was deliberately sent to it.
In LangGraph, I usually model this with Annotated[list[Message], operator.add] as the primary coordination channel, combined with a supervisor node that reads the message queue and routes based on message type and content rather than on global state fields.
The practical difference is ownership. Each agent produces a Message with an explicit to, from_, type, and payload. The supervisor reads the queue, dispatches messages to the appropriate agent, and the agent processes its inbox. State is still shared — the messages list is global — but writes are constrained. No agent writes to "its" fields; it only appends to the shared channel.
from typing import Annotated
import operator
class Message(TypedDict):
id: str
from_: str
to: str
type: Literal["request", "result", "error", "status"]
payload: dict
class AgentState(TypedDict):
messages: Annotated[list[Message], operator.add] # append-only
session_id: str
completed_agents: Annotated[set[str], operator.or_]
What changes: the researcher does not write to research_results. It sends Message(from_="researcher", to="writer", type="result", payload={"findings": ...}). The writer node filters state["messages"] for messages addressed to it and processes its inbox. There is no shared field both agents write to; there is only a channel both append to.
3. The supervisor pattern and when it earns its cost
A supervisor agent is a router with memory. It receives messages, decides which agent runs next, and maintains global task status. It also adds a model call to routing steps. In a 10-step pipeline, that can mean 10 additional LLM calls.
The cost is real. I use a supervisor only when the control it adds is worth that cost.
Without a supervisor, adding a new agent requires updating every agent that might hand off to it. Changing routing logic requires updating edge functions. Debugging why a task stalled means reading through the message queue manually.
With a supervisor, routing logic is centralized. Adding a new agent means adding it to the supervisor's tool list. The supervisor's routing decision is visible in its logged decision or emitted message. Debugging a stalled task starts with the supervisor's last observable decision.
As a rule of thumb, I start considering a supervisor around three agents. Below that, peer-to-peer handoffs with conditional edges are often cheaper and easier to inspect. At three or more agents, centralized routing can be worth the extra calls, especially when the agent set is still changing.
The supervisor anti-pattern is using the supervisor as a glorified if-else router, with explicit rules like "if the user mentions Python, route to the coder". A supervisor adds value when it needs to track task state across multiple steps. It adds little when it only pattern-matches on one field. If the routing logic fits in a switch statement, I remove the supervisor and use conditional edges.
4. Tool governance in multi-agent systems
When I move from one agent to many, tool access becomes a correctness concern, not just a security concern. An agent with access to tools it should not have may use them, especially if its context includes instructions from another agent that suggest doing so.
I use a simple principle: tool access should match agent role, not agent capability. The researcher agent should not have write access to the file system even if the underlying model can generate valid file-write commands. The writer should not have search tools even if it could use them to "verify" its output — that is the researcher's job.
# Researcher: read-only tools
researcher_agent = create_react_agent(
llm,
tools=[search_web, fetch_url, read_document],
prompt="You retrieve and synthesize information. Do not write or modify files."
)
# Writer: write tools only, no search
writer_agent = create_react_agent(
llm,
tools=[write_draft, format_output],
prompt="You write and format content based on researcher findings. Do not search."
)
# Supervisor: routing tools only, no domain tools
supervisor_agent = create_react_agent(
llm,
tools=[route_to_researcher, route_to_writer, mark_complete],
prompt="You coordinate the research and writing workflow."
)
I do not treat this primarily as a security measure. I use it to reduce the agent's action space to what is appropriate for its role. The researcher cannot accidentally write a file; the writer cannot accidentally search instead of writing; the supervisor cannot accidentally execute domain actions that should go through a worker.
5. Failure detection and recovery
Shared-state systems often fail with little local evidence. Message-passing systems make failures easier to observe when delivery and logging are explicit: I can inspect the queue, compare sent messages with received messages, and identify requests with no response.
The recovery primitive is straightforward: if an agent has not responded within a timeout, the supervisor can re-route the request or escalate to HITL. This requires that every request message has an id and that the supervisor maintains a registry of outstanding requests.
import time
class SupervisorState(TypedDict):
messages: Annotated[list[Message], operator.add]
outstanding: dict[str, float] # message_id -> sent_at timestamp
def supervisor_node(state: SupervisorState) -> dict:
now = time.time()
for msg_id, sent_at in list(state["outstanding"].items()):
if now - sent_at > 30: # 30-second timeout
return {
"messages": [Message(
id=new_id(),
from_="supervisor",
to="hitl",
type="request",
payload={"reason": f"Agent timeout on message {msg_id}"}
)]
}
pending = [m for m in state["messages"] if m["to"] == "supervisor" and m["type"] == "result"]
# route based on pending results ...
I treat this pattern — outstanding request registry with timeout and escalation — as the agent equivalent of a circuit breaker. It replaces "wait indefinitely for an agent that is stuck" with "detect the stuck agent and route to recovery". Without it, a single slow or failing agent can stall the pipeline with no clear signal.
6. Coordination patterns by system size
I choose the coordination model based on the number of agents and how they interact:
2 agents: Direct handoff. Agent A runs, writes output to a named field with a clear reducer, and Agent B reads that field. No supervisor is needed. A conditional edge routes from A to B based on A's output status. This is the simplest correct design I use for this problem shape. I avoid adding coordination machinery before I need it.
3–5 agents: Supervisor with message bus. The supervisor routes between agents. Agents communicate through the append-only message channel, not through named fields. Each agent filters its inbox. The supervisor decides which message should trigger which agent. This is the range where I usually start to prefer explicit routing and durable message history.
5+ agents or dynamic sets: Hierarchical supervisor. A top-level supervisor delegates to sub-supervisors. Each sub-supervisor manages a team of specialized agents. The top-level supervisor talks to sub-supervisors, not directly to domain agents. This adds coordination overhead, but it keeps boundaries clearer. Each sub-supervisor can be developed and tested independently.
I do not treat the transition between these tiers as arbitrary. I look for the failure modes. With 2 agents, write conflicts are easier to reason about because there are only two writers. As the agent count grows and parallelism increases, append-only channels become more important. When a single supervisor becomes a routing bottleneck, I consider hierarchical delegation.
The category frame
Shared state is the right model for a single agent with a clear schema. I do not use it as the primary coordination model for a team of agents with overlapping write access. The failure modes — write conflicts, stale reads, phantom checkpoints — are not edge cases. They are properties of the coordination model that become visible when I add parallelism.
Message-passing with a supervisor is not architecturally pure; it is a pragmatic trade. Debugging becomes more direct when I can inspect the message queue. Recovery becomes tractable when I can requeue timed-out messages. Growth becomes more controlled when I add an agent by adding it to the supervisor's tool list. In a 10-step system, the extra routing calls may be justified when debugging and recovery matter more than latency.
If you are debugging state corruption, routing instability, or checkpoint replay bugs in a multi-agent system, you can write me with the failure mode and trace shape. I treat these as solvable problems with known patterns.
FAQ
When does shared state start to break in multi-agent systems?
I stop trusting shared state when several agents coordinate through overlapping writes. Around three agents, I start considering a supervisor; with 3–5 agents, I usually prefer a supervisor with an append-only message bus and durable message history.
What are the main failure modes of a shared blackboard?
I see three recurring failure modes: write conflicts, stale reads, and phantom state. These are not LangGraph or checkpointer bugs; they follow from a coordination model where multiple agents read from and write to the same state object without explicit ownership.
How does a message bus reduce shared-state coordination bugs?
A message bus changes ownership. Agents do not write to each other's fields or read another agent's current output directly. Each agent appends typed messages with explicit to, from_, type, and payload, then processes messages addressed to its inbox.
When is a supervisor worth the extra LLM calls?
I use a supervisor only when centralized routing, task-state tracking, and observable decisions are worth the added calls. It earns its cost when routing logic changes often, the agent set is still changing, or debugging should start from the supervisor's last decision.
How should tool access be assigned across agents?
I assign tool access by agent role, not by model capability. The researcher gets read-only tools, the writer gets write and formatting tools, and the supervisor gets routing tools. This reduces each agent's action space to what is appropriate for its role.
Related articles
State as the API: LangGraph After Three Rewrites
The state schema is the most consequential decision in LangGraph. Three iterations on modeling it, and why channels with reducers are the right primitive.
Jan 8, 202512 min read#LangGraph#LLM#Multi-Agent#OrchestrationWhere CrewAI Breaks in Production — and What to Use Instead
The role abstraction in CrewAI works for demos and struggles under production load. Four specific failure modes and the LangGraph patterns that replaced them.
Jan 15, 202510 min read#CrewAI#Multi-Agent#LangGraph#ProductionWhat Bank-Grade Key Management Teaches You About Agent Eval Harnesses
Five disciplines from banking security — durable state, deterministic recovery, dual control, and audit trails — applied to LLM agent evaluation.
Apr 18, 20265 min read#Agent Evals#Verifiable Systems#LLM Production#Banking#MCP