Why Autonomous Research Agents Hallucinate — and How a Critic Loop Fixes It
Planner-executor agents fail on verifiability. Adding a critic agent with independent source access is the structural fix that survives adversarial queries.
Why this matters
An autonomous research agent that hallucinates is worse than no agent at all. It produces outputs that look authoritative, include plausible-sounding citations, and are partially or entirely wrong. The researcher who receives the output has no efficient way to verify it — that's why they delegated to an agent in the first place.
I've built research agents that failed quietly in exactly this way. The planner generated good sub-questions. The executor retrieved real documents. Then the synthesis step — under context pressure, trying to produce a coherent report from partial results — would fill gaps with plausible-sounding claims that had no source. The claim wasn't in any retrieved document. It was in the model's training data, laundered through the appearance of research.
The failure is architectural, not prompt-engineering. A planner-executor pair with no verification step is structurally incapable of distinguishing "we found this in a source" from "the model is confident about this". Adding a critic with independent source access is the structural fix. Here's why, and what it looks like in practice.
1. The planner-executor pattern and its limits
The planner-executor pair is the standard blueprint for autonomous research: a planner decomposes a high-level topic into sub-questions, an executor retrieves and summarizes answers to each sub-question, and the planner synthesizes the executor's outputs into a final report.
This works well when the corpus is well-defined, the sub-questions are independently answerable, and the retrieved documents are dense with relevant information. It breaks when the corpus is sparse, the sub-questions are ambiguous, or the synthesis step is trying to reconcile partial and contradictory sources.
The synthesis failure mode is the dangerous one. The planner receives executor outputs like "Source A says X" and "Source B is ambiguous about Y" and must produce a coherent report. Under context pressure — limited tokens, many sources — the synthesis model fills the ambiguous parts with what it expects the answer to be, based on training data. The output sounds coherent. The claims are unverifiable.
A critic loop doesn't fix root causes (sparse corpora, ambiguous questions) but it surfaces the failure: it flags claims that can't be mapped to retrieved sources, distinguishes "found in documents" from "inferred from training", and gives the planner enough signal to re-query or explicitly mark uncertainty.
2. What a critic agent actually does
The critic is not a second synthesizer. It's a verification agent with two specific capabilities: access to the same source documents the executor retrieved, and a strict grounding constraint.
The critic's task: given the executor's summary and the source documents, mark every claim as grounded (found verbatim or paraphrased in a source), inferred (logically follows from sources but not stated directly), or unsupported (not found in any retrieved source).
from pydantic import BaseModel
from typing import Literal
class ClaimVerification(BaseModel):
claim: str
status: Literal["grounded", "inferred", "unsupported"]
source_url: str | None # required when status == "grounded"
confidence: float # 0.0 to 1.0
class CriticOutput(BaseModel):
verified_claims: list[ClaimVerification]
overall_groundedness: float # fraction of claims that are "grounded"
flags: list[str] # specific issues for the planner to act on
The critic returns this structured output to the planner — not as free text and not to the executor. The planner reads overall_groundedness and the flags list to decide what happens next: if groundedness is above threshold (e.g., 0.8), the report is approved. If below, the planner re-queues the flagged sub-questions with explicit instructions to find sources for the unsupported claims.
This loop has a max-iterations guard. If groundedness doesn't improve after two re-query cycles, the planner marks the low-confidence claims explicitly in the final report rather than silently passing them through.
3. Independent source access for the critic
The critic must have access to the same source documents the executor retrieved — not the executor's summary of those documents. This is the key structural requirement.
If the critic only sees the executor's summary, it can verify internal consistency ("do these claims contradict each other?") but not external grounding ("is this claim actually in the source?"). A summary can be internally consistent and completely wrong.
# Wrong: critic sees only the summary
critic_input = {
"summary": executor_output.summary,
"task": "Verify the claims in this summary."
}
# Critic can check consistency but not source grounding
# Right: critic sees summary and original documents
critic_input = {
"summary": executor_output.summary,
"source_documents": executor_output.retrieved_docs, # original text, not summaries
"task": "For each claim in the summary, verify it against the source documents."
}
# Critic can map claims to specific passages
The practical consequence: the critic's context window must hold the summary and the relevant source excerpts. For long-form research tasks with many sources, this means either running the critic on one sub-question at a time, or using a large-context model for the critic pass only.
Running per-sub-question is cheaper but misses multi-source claims ("Sources A and B both confirm that...") unless you pass the full claim context explicitly. Running the critic on the full corpus at once is more expensive but catches cross-source contradictions — the critic can see whether two sources say conflicting things about the same claim and flag both.
4. Recursive summarization and citation tracing
Autonomous research agents frequently face more retrieved content than fits in a single context window. The standard response is recursive summarization: summarize document A, summarize document B, synthesize the summaries.
Recursive summarization is correct for compression. It's problematic for citation tracing. When you summarize document A, you produce a condensed version that loses specific passages — the critic can no longer map claims to exact text in document A. You've severed the evidentiary chain.
The design that preserves it: store the original document alongside the summary, and pass both to the critic.
class ExecutorResult(BaseModel):
sub_question: str
summary: str # compressed, used for planning
source_docs: list[str] # original text, used for critic verification
source_urls: list[str]
The critic reads the summary to understand claim context, then searches the original for the specific passage. This doubles the data stored per executor result — acceptable when the corpus is measured in thousands of tokens. If corpus size makes this infeasible, the alternative is embedding the originals and running citation search at critic time rather than passing them inline.
5. Map-reduce for independent sub-questions
When the research task decomposes into independent sub-questions — market trends, competitive landscape, regulatory environment — the planner can dispatch the executor in parallel (map) and aggregate verified results (reduce).
The map step dispatches all sub-questions simultaneously. The reduce step synthesizes all verified executor outputs into a final report. The critic runs on the final synthesis against the full corpus, not on each individual executor result.
This is faster than sequential execution and maintains verifiability — the reduce step passes all source documents to the critic, which can verify the final synthesis against everything retrieved.
The failure mode to avoid in the reduce step: the planner synthesizes across sub-questions without checking whether the same claim appeared in multiple executor results with conflicting values. "The market size is $5B" from one sub-question and "The market size is $3B" from another need to be flagged, not averaged. The critic's flags list surfaces these conflicts.
6. Handling coverage gaps explicitly
A research agent that only succeeds when good sources exist is half-built. The critic loop must also handle the case where retrieved sources are insufficient for the question.
The signal: if the critic consistently marks claims as unsupported across two re-query cycles with different search terms, the failure is likely corpus coverage, not query quality. The right response is to explicitly report the coverage gap rather than producing a plausible-sounding answer.
class ResearchReport(BaseModel):
findings: list[ClaimVerification]
coverage_gaps: list[str] # topics where sources were insufficient
confidence: float # overall groundedness across all findings
generated_at: str # ISO timestamp
The coverage_gaps field is required — the planner must fill it, even if with an empty list. A report schema without a required coverage gap field will silently omit them.
Explicit coverage gaps are more useful than confidently wrong answers. A user who sees "Coverage gap: regulatory landscape in EU after 2023 — sources available only through Q2 2023" knows exactly what additional research is needed. A user who receives a hallucinated answer has no signal that anything is wrong until downstream validation fails.
The category frame
Autonomous research agents that produce verifiable outputs need three components: a planner that decomposes and synthesizes, an executor that retrieves and summarizes with source preservation, and a critic that verifies grounding with access to original documents.
The critic loop doesn't eliminate hallucination — models will always have priors from training data. It makes hallucination visible: the critic distinguishes "found in sources" from "inferred" from "unsupported", and the planner reports that distinction rather than flattening it into false confidence.
The structural guarantee: every claim in the final report is either traced to a source URL, explicitly marked as inferred, or surfaced as a coverage gap. There is no fourth category.
If you're building research automation that needs to meet a verifiability standard — compliance, legal, financial — write me. That's where this pattern earns its complexity cost.