Why Autonomous Research Agents Hallucinate — and How a Critic Loop Fixes It
Planner-executor agents fail on verifiability. Adding a critic agent with independent source access is the structural fix that survives adversarial queries.
Why this matters
An autonomous research agent that hallucinates can be worse than no agent at all. It produces outputs that look authoritative, include plausible citations, and may be partly or entirely wrong. The researcher receiving the output has no efficient way to verify it. That is often why the work was delegated to an agent in the first place.
I've built research agents that failed quietly in exactly this way. The planner generated useful sub-questions. The executor retrieved real documents. Then the synthesis step tried to turn partial results into a coherent report under context pressure. It filled gaps with claims that had no source.
The claim was not in any retrieved document. It came from the model's training data, then appeared in the report as if it had been researched.
The failure is architectural, not prompt-engineering. A planner-executor pair with no verification step cannot reliably distinguish "found in a source" from "the model is confident about this". Adding a critic with independent source access is a structural fix. Here's why, and what it looks like in practice.
1. The planner-executor pattern and its limits
The planner-executor pair is the standard blueprint for autonomous research. A planner decomposes a high-level topic into sub-questions. An executor retrieves and summarizes answers to each sub-question. The planner then synthesizes the executor's outputs into a final report.
This works when the corpus is well-defined, the sub-questions are independently answerable, and the retrieved documents contain enough relevant information. It breaks when the corpus is sparse, the sub-questions are ambiguous, or the synthesis step has to reconcile partial and contradictory sources.
The synthesis failure mode is the dangerous one. The planner receives executor outputs such as "Source A says X" and "Source B is ambiguous about Y". It still has to produce a coherent report. With limited tokens and many sources, the model may fill ambiguous parts with what it expects the answer to be, based on training data.
The output sounds coherent. The claims are unverifiable.
I do not expect a critic loop to fix sparse corpora or ambiguous questions. Its job is narrower: surface the failure. It flags claims that cannot be mapped to retrieved sources and gives the planner enough signal to re-query or explicitly mark uncertainty.
2. What a critic agent actually does
The critic is not a second synthesizer. I use it as a verification agent with two specific capabilities: access to the same source documents the executor retrieved, and a strict grounding constraint.
The task is intentionally narrow. Given the executor's summary and the source documents, the critic marks every claim as grounded (found verbatim or paraphrased in a source), inferred (logically follows from sources but is not stated directly), or unsupported (not found in any retrieved source).
from pydantic import BaseModel
from typing import Literal
class ClaimVerification(BaseModel):
claim: str
status: Literal["grounded", "inferred", "unsupported"]
source_url: str | None # required when status == "grounded"
confidence: float # 0.0 to 1.0
class CriticOutput(BaseModel):
verified_claims: list[ClaimVerification]
overall_groundedness: float # fraction of claims that are "grounded"
flags: list[str] # specific issues for the planner to act on
I keep this output structured because the planner needs a stable control signal. The critic does not return free text, and it does not route the result back to the executor.
The planner reads overall_groundedness and the flags list, then chooses the next step. If groundedness exceeds a configured threshold, for example 0.8, the report can be approved. If it falls below that threshold, the planner re-queues the flagged sub-questions with explicit instructions to find sources for the unsupported claims.
This loop also needs a max-iterations guard. In my own implementations, I prefer a hard stop over an agent that keeps searching until it finds something plausible. If groundedness does not improve after two re-query cycles, the planner marks the low-confidence claims explicitly in the final report instead of silently passing them through.
3. Independent source access for the critic
The critic must have access to the same source documents the executor retrieved. It cannot rely only on the executor's summary of those documents. This is a requirement I do not relax.
If the critic only sees the executor's summary, it can check whether the summary contradicts itself. It cannot check whether a claim is actually present in the source. A summary can be internally consistent and still be wrong.
# Wrong: critic sees only the summary
critic_input = {
"summary": executor_output.summary,
"task": "Verify the claims in this summary."
}
# Critic can check consistency but not source grounding
# Right: critic sees summary and original documents
critic_input = {
"summary": executor_output.summary,
"source_documents": executor_output.retrieved_docs, # original text, not summaries
"task": "For each claim in the summary, verify it against the source documents."
}
# Critic can map claims to specific passages
The practical consequence is context size. The critic's context window must hold the summary and the relevant source excerpts.
For long-form research tasks with many sources, I usually handle this in one of two ways. I either run the critic on one sub-question at a time, or I use a large-context model for the critic pass only.
Running per sub-question is cheaper, but it can miss multi-source claims such as "Sources A and B both confirm that..." unless the full claim context is passed explicitly. Running the critic on the full corpus at once is more expensive, but it can catch cross-source contradictions. For example, it can see when two sources give different dates, different revenue figures, or conflicting regulatory interpretations for the same claim.
4. Recursive summarization and citation tracing
Autonomous research agents frequently retrieve more content than fits in a single context window. The standard response is recursive summarization: summarize document A, summarize document B, then synthesize the summaries.
Recursive summarization is useful for compression. It is risky for citation tracing.
When document A is summarized, the condensed version loses specific passages. The critic can no longer map claims to exact text in document A. The evidentiary chain has been severed.
A simple way to preserve it is to store the original document alongside the summary, then pass both to the critic.
class ExecutorResult(BaseModel):
sub_question: str
summary: str # compressed, used for planning
source_docs: list[str] # original text, used for critic verification
source_urls: list[str]
The critic reads the summary to understand claim context. It then searches the original text for the specific passage.
This doubles the data stored per executor result, which is acceptable when the corpus is measured in thousands of tokens. If corpus size makes this infeasible, I would embed the originals and run citation search at critic time instead of passing them inline.
5. Map-reduce for independent sub-questions
When the research task decomposes into independent sub-questions — market trends, competitive landscape, regulatory environment — the planner can dispatch the executor in parallel (map) and aggregate verified results (reduce).
The map step dispatches all sub-questions simultaneously. The reduce step synthesizes all verified executor outputs into a final report. The critic runs on the final synthesis against the full corpus, not on each individual executor result.
This is faster than sequential execution and keeps verification attached to the final output. The reduce step passes all source documents to the critic, so the final synthesis can be checked against everything retrieved.
The failure mode to avoid in the reduce step is cross-question drift. The planner may synthesize across sub-questions without noticing that the same claim appeared in multiple executor results with conflicting values. "The market size is $5B" from one sub-question and "The market size is $3B" from another should be flagged, not averaged. The critic's flags list surfaces these conflicts.
6. Handling coverage gaps explicitly
A research agent that works with sparse sources needs explicit coverage-gap handling. The critic loop also needs to handle cases where retrieved sources are insufficient for the question.
The signal is repeated unsupported claims. If the critic consistently marks claims as unsupported across two re-query cycles with different search terms, the failure is likely corpus coverage, not query quality. In that case, I prefer to report the coverage gap explicitly instead of producing a plausible-sounding answer.
class ResearchReport(BaseModel):
findings: list[ClaimVerification]
coverage_gaps: list[str] # topics where sources were insufficient
confidence: float # overall groundedness across all findings
generated_at: str # ISO timestamp
The coverage_gaps field is required. I make the planner fill it, even if the value is an empty list. A report schema without a required coverage gap field can allow gaps to be omitted.
Explicit coverage gaps are more useful than confidently wrong answers. A user who sees "Coverage gap: regulatory landscape in EU after 2023 — sources available only through Q2 2023" knows what additional research is needed. A user who receives a hallucinated answer has no signal that anything is wrong until downstream validation fails.
The category frame
Autonomous research agents that produce verifiable outputs need three components: a planner that decomposes and synthesizes, an executor that retrieves and summarizes with source preservation, and a critic that checks claims against original documents.
The critic loop does not eliminate grounding errors. It makes unsupported claims visible and actionable. The planner can then preserve the distinction between sourced claims, justified inferences, and coverage gaps instead of flattening it into false confidence.
The goal is to make every claim either traceable to a source URL, explicitly marked as inferred, or surfaced as a coverage gap. The system should not silently create a fourth category.
If you're building research automation for compliance, legal, or financial workflows, write me and I can review the verification architecture against the checks described here. I can help examine where source preservation, critic access, and coverage-gap reporting fit into the system design.
FAQ
Why does a planner-executor research agent hallucinate during synthesis?
I see the dangerous failure in the synthesis step. The planner receives partial or ambiguous executor outputs and must produce a coherent report under context pressure. In that setting, the model may fill gaps with claims from training data. Those claims can then appear as if they were found in retrieved sources.
What does a critic agent verify in an autonomous research workflow?
I use the critic as a narrow verification agent, not a second synthesizer. It receives the executor summary and the source documents. For each claim, it returns a status: grounded, inferred, or unsupported. It also returns structured output with planner-facing flags and an overall groundedness score.
Why must the critic see the original source documents?
I do not let the critic rely only on the executor summary. A summary can be internally consistent and still be wrong. The critic needs the original retrieved documents to map claims to specific source passages.
How should recursive summarization preserve citation tracing?
I keep the original document alongside the compressed summary. The summary is useful for planning, but it can lose the exact passages needed for verification. The critic reads the summary for claim context, then checks the original text for the supporting passage.
What should happen when claims remain unsupported after re-querying?
I treat repeated unsupported claims across two re-query cycles with different search terms as a likely coverage gap, not just a query failure. The planner should report the gap explicitly. It should not pass through a plausible-sounding answer with false confidence.
Related articles
RAG in Production: Fix Chunking and Re-Ranking Before Touching Embeddings
Most RAG pipelines fail on chunking or re-ranking before embedding quality. A diagnostic-first framework for finding and fixing the right bottleneck.
Dec 20, 202412 min read#RAG#Retrieval#LLM#ProductionFive Function-Calling Patterns That Survived Production
Tool use is where LLM systems fail most reliably. A catalog of five patterns that held under production load — and the anti-patterns they replaced.
Nov 25, 202411 min read#LLM#Function Calling#OpenAI#ProductionIs Saying an LLM Doesn't Think Like Saying a Calculator Can't Do Numbers?
Where the calculator analogy for LLMs holds and where it breaks: what interpretability, chain-of-thought and philosophy of mind say about thinking.
Jul 2, 202616 min read#LLM#AI Reasoning#Interpretability#Philosophy of Mind