Gianluca Mazza — Why Shared State Breaks Multi-Agent Systems Past Three Agents

Why this matters

The first multi-agent system I built used a shared state dict. Three agents — a planner, a researcher, and a writer — all reading and writing to the same LangGraph AgentState. It worked beautifully at demo time. Two weeks later in production, with five agents and concurrent execution, it failed in ways that were nearly impossible to debug.

Not crashes. Worse: subtly wrong outputs. The planner would overwrite a research result the writer was mid-way through consuming. An agent would read stale state from a previous session because we hadn't isolated the checkpoint. The supervisor would route to an agent that had already completed its task and was waiting for its own output, which had been overwritten by a parallel agent.

Shared state is not wrong. It's wrong past a certain scale of agent coordination. Understanding where that scale breaks — and what to replace it with — is the difference between a system that works in testing and one that works in production.

1. The three failure modes of shared blackboards

A shared blackboard is any design where multiple agents read from and write to the same state object without explicit ownership. This includes LangGraph's flat TypedDict without reducer annotations, shared Python dicts passed by reference, and database tables without row-level locking.

Failure mode 1: write conflicts. Two agents produce outputs in the same step; the last write wins. If both agents write to messages, you lose one message. If both write to current_plan, one plan is silently discarded. This is deterministic — given the same inputs, the same agent always loses — but the loss is invisible without inspecting state at every checkpoint.

Failure mode 2: stale reads. Agent B reads state that Agent A wrote two steps ago. This is valid in sequential pipelines; it's a correctness bug in parallel ones. When Agent B assumes the researcher's output is current but the planner has since revised the research direction, Agent B produces an answer to a question that's no longer being asked.

Failure mode 3: phantom state. Checkpointing a multi-agent system means saving shared state at a point in time. If Agent C reads state, does work, and is checkpointed, then Agent A overwrites the same fields, replaying from Agent C's checkpoint restores the version of state Agent A already moved past. The replay is internally consistent but externally wrong.

These failures are not bugs in LangGraph or in your checkpointer. They are properties of the shared-state coordination model itself. Adding infrastructure around shared state — locks, versioned fields, read-your-writes guarantees — reconstructs a distributed database, which is the wrong primitive for an agent system.

2. The message bus alternative

A message bus inverts the coordination model: instead of agents reading from a shared object, agents send typed messages to one another through an explicit channel. No agent ever reads another agent's "current output" — it receives a message deliberately sent to it.

In LangGraph, this means using Annotated[list[Message], operator.add] as the primary coordination channel, combined with a supervisor node that reads the message queue and routes based on message type and content rather than on global state fields.

The practical difference: each agent produces a Message with an explicit to, from_, type, and payload. The supervisor reads the queue, dispatches messages to the appropriate agent, and the agent processes its inbox. State is still shared — the messages list is global — but ownership is explicit. No agent writes to "its" fields; it only appends to the shared channel.

from typing import Annotated
import operator

class Message(TypedDict):
    id: str
    from_: str
    to: str
    type: Literal["request", "result", "error", "status"]
    payload: dict

class AgentState(TypedDict):
    messages: Annotated[list[Message], operator.add]  # append-only
    session_id: str
    completed_agents: Annotated[set[str], lambda a, b: a | b]

What changes: the researcher doesn't write to research_results. It sends Message(from_="researcher", to="writer", type="result", payload={"findings": ...}). The writer node filters state["messages"] for messages addressed to it and processes its inbox. There is no shared field both agents write to; there is only a channel both append to.

3. The supervisor pattern and when it earns its cost

A supervisor agent is a router with memory. It receives all messages, decides which agent runs next, and maintains global task status. It adds a model call to every routing step — in a 10-step pipeline, that's 10 additional LLM calls.

The cost is real. The question is what it buys.

Without a supervisor: adding a new agent requires updating every agent that might hand off to it. Changing routing logic requires updating edge functions. Debugging why a task stalled means reading through the message queue manually.

With a supervisor: routing logic is centralized. Adding a new agent means adding it to the supervisor's tool list. The supervisor's reasoning about "who handles this next" is visible in its chain-of-thought. Debugging a stalled task means looking at the supervisor's last decision.

The break-even point: three agents. Below three, peer-to-peer handoffs with conditional edges are cheaper and simpler. At three or more, the supervisor's centralized routing is worth the cost — especially if the agent set is evolving.

The supervisor anti-pattern: using the supervisor as a glorified if-else router, with explicit rules like "if the user mentions Python, route to the coder". A supervisor adds value when it needs to reason about task state across multiple steps, not when it's pattern-matching on a single field. If your supervisor's routing logic fits in a switch statement, remove it and use conditional edges.

4. Tool governance in multi-agent systems

When you move from one agent to many, tool access becomes a correctness concern, not just a security concern. An agent with access to tools it shouldn't have will use them — especially if its context includes instructions from another agent that suggest it.

The principle: tool access should match agent role, not agent capability. The researcher agent should not have write access to the file system even if the underlying model could generate valid file-write commands. The writer should not have search tools even if it could use them to "verify" its output — that's the researcher's job.

# Researcher: read-only tools
researcher_agent = create_react_agent(
    llm,
    tools=[search_web, fetch_url, read_document],
    system_message="You retrieve and synthesize information. Do not write or modify files."
)

# Writer: write tools only, no search
writer_agent = create_react_agent(
    llm,
    tools=[write_draft, format_output],
    system_message="You write and format content based on researcher findings. Do not search."
)

# Supervisor: routing tools only, no domain tools
supervisor_agent = create_react_agent(
    llm,
    tools=[route_to_researcher, route_to_writer, mark_complete],
    system_message="You coordinate the research and writing workflow."
)

This is not primarily about security — it's about reducing the agent's action space to what's appropriate for its role. A smaller action space produces more reliable decisions. The researcher can't accidentally write a file; the writer can't accidentally search instead of writing; the supervisor can't accidentally execute domain actions that should go through a worker.

5. Failure detection and recovery

Shared-state systems fail silently. Message-passing systems fail explicitly — a message either arrives, or it doesn't, and you can inspect the queue to see which.

The recovery primitive: if an agent hasn't responded within a timeout, the supervisor can re-route the request or escalate to HITL. This requires that every request message has an id and that the supervisor maintains a registry of outstanding requests.

import time

class SupervisorState(TypedDict):
    messages: Annotated[list[Message], operator.add]
    outstanding: dict[str, float]  # message_id -> sent_at timestamp

def supervisor_node(state: SupervisorState) -> dict:
    now = time.time()
    for msg_id, sent_at in list(state["outstanding"].items()):
        if now - sent_at > 30:  # 30-second timeout
            return {
                "messages": [Message(
                    id=new_id(),
                    from_="supervisor",
                    to="hitl",
                    type="request",
                    payload={"reason": f"Agent timeout on message {msg_id}"}
                )]
            }
    pending = [m for m in state["messages"] if m["to"] == "supervisor" and m["type"] == "result"]
    # route based on pending results ...

This pattern — outstanding request registry with timeout and escalation — is the agent equivalent of a circuit breaker. It replaces "wait indefinitely for an agent that's stuck" with "detect the stuck agent and route to recovery". Without it, a single slow or failing agent stalls the entire pipeline with no signal.

6. Coordination patterns by system size

The right coordination model depends on the number of agents and how they interact:

2 agents: Direct handoff. Agent A runs, writes output to a named field (with a clear reducer), Agent B reads that field. No supervisor needed. A conditional edge routes from A to B based on A's output status. This is the simplest correct design; don't add complexity you don't need.

3–5 agents: Supervisor with message bus. The supervisor routes between agents; agents communicate via the append-only message channel, not via named fields. Each agent has an inbox filter; the supervisor ensures the right message reaches the right agent. This is where most production multi-agent systems should live.

5+ agents or dynamic sets: Hierarchical supervisor. A top-level supervisor delegates to sub-supervisors, each of which manages a team of specialized agents. The top-level supervisor never talks directly to domain agents — only to sub-supervisors, which route to their agents. This adds coordination overhead but makes the system modular: each sub-supervisor can be developed and tested independently.

The transition between these tiers is not arbitrary — it's driven by when the failure modes appear. At 2 agents, write conflicts are manageable because there are only two writers. At 5+, they're nearly guaranteed without append-only channels. At 10+, a single supervisor is a routing bottleneck; hierarchical delegation becomes necessary.

The category frame

Shared state is the right model for a single agent with a clear schema. It's the wrong model for a team of agents with overlapping write access. The failure modes — write conflicts, stale reads, phantom checkpoints — are not edge cases. They're properties of the coordination model that become visible the moment you add parallelism.

Message-passing with a supervisor is not architecturally pure — it's a pragmatic trade. Debugging is explicit (inspect the message queue). Recovery is tractable (requeue timed-out messages). Growth is additive (add an agent by adding it to the supervisor's tool list). The 10 additional LLM calls per pipeline in a 10-step system are worth it.

If you're scaling a multi-agent system and hitting state corruption, routing instability, or checkpoint replay bugs, write me. These are solvable problems with known solutions.