← Back to blog

The Fabricated Citation Problem in Multi-Agent Systems

In June 2023, attorney Steven Schwartz submitted a legal brief in Mata v. Avianca containing six case citations generated by ChatGPT. Every one was fabricated — realistic case names, plausible docket numbers, convincing judicial opinions, none of which existed. He was fined $5,000. The legal profession treated it as a cautionary tale about a solo practitioner who didn't know better.

Three years later, Sullivan & Cromwell — a firm with 1,000+ lawyers and a dedicated AI policy — apologized to a bankruptcy judge for an emergency motion containing 42 inaccuracies, including fabricated citations and misquoted statutory provisions. Damien Charlotin's AI Hallucination Cases Database now tracks 1,348 documented cases worldwide. A single day in March 2026 produced 17 separate court decisions noting suspected hallucinations.

These are not isolated mistakes by careless users. The fabricated citation is a structural failure mode of multi-agent systems that produce referenced output — and it extends far beyond the courtroom.

The anatomy of a fabricated citation

Consider a standard multi-agent document review pipeline. Three agents, three jobs:

  • Agent A (Retriever): Ingests a document corpus. Chunks, embeds, indexes. When queried, returns relevant passages with source metadata.
  • Agent B (Analyst): Receives retrieved passages. Synthesizes findings. Produces structured risk assessments with inline references.
  • Agent C (Drafter): Takes risk assessments and produces a final memo for human review, citing specific documents and sections.

The pipeline runs. Agent A retrieves 47 passages from 200 documents. Agent B produces 12 risk findings. Agent C writes a 4-page memo with 23 citations. Here is one of them:

The Share Purchase Agreement contains a change-of-control provision (Section 4.2) that would trigger accelerated vesting of all unvested equity upon completion of the proposed acquisition, creating an estimated $12.3M in additional dilution.

This looks authoritative. It is specific. It references a real document type that exists in the corpus. A human reviewer skimming the memo will likely accept it at face value. But three things might be wrong:

  1. Section 4.2 does not exist. The Share Purchase Agreement has 3 sections. The model interpolated a section number that pattern-matches what SPAs typically contain.
  2. Section 4.2 exists but says something different. It covers representations and warranties, not change-of-control. The model conflated content from two different documents.
  3. The passage was never retrieved. Agent A returned passages from the SPA, but none of them mentioned change-of-control. Agent B inferred the risk from a different document and Agent C misattributed it to the SPA.

Each of these is a distinct failure mode. Each is invisible to traditional observability. And each has materially different consequences for the human relying on the output.

What your observability stack sees

If you are running any of the standard tools—LangSmith, Langfuse, Arize, Datadog LLM Observability, or a custom OpenTelemetry setup—here is roughly what you get for this run:

Run: doc-review-run-4821
Status: completed
Duration: 47.2s
Agents invoked: 3
Model calls: 18
Tokens (input): 142,847
Tokens (output): 8,291
Tool calls: 52 (47 retrieval, 3 write, 2 format)
Errors: 0
Latency p95: within SLA

Green across the board. The trace is structurally complete. Every span has a parent. Every tool call returned a result. The run started and the run finished. If you have cost monitoring, you see that you spent $0.89 in API calls. If you have latency alerts, nothing fired.

You might also have the raw trace—a sequence of events showing what happened:

event: run_started        run_id: 4821
event: agent_started      agent: retriever
event: tool_called        tool: vector_search  query: "change of control provisions"
event: tool_completed     tool: vector_search  results: 6 passages
event: agent_completed    agent: retriever
event: agent_started      agent: analyst
event: model_called       model: gpt-4o  tokens_in: 12847
event: model_called       model: gpt-4o  tokens_in: 9421
event: agent_completed    agent: analyst
event: agent_started      agent: drafter
event: model_called       model: gpt-4o  tokens_in: 8103
event: agent_completed    agent: drafter
event: run_completed      run_id: 4821  duration: 47.2s

This trace tells you what happened. It does not tell you whether what happened was correct. The distinction is the entire problem.

The structural gap

The gap is not a missing feature. It is a missing layer.

Traces record events. APMs measure performance. Neither answers the questions that actually matter for a workflow that produces consequential output:

  • Did the cited document actually appear in the retrieval results?
  • Does the cited section contain the claim attributed to it?
  • Was the information in the final output faithfully derived from the source material, or was it confabulated during synthesis?
  • If two agents disagree on a fact, which one's source material is authoritative?

These are not performance questions. They are investigation questions. They require walking the trace backward from output to source, comparing content at each handoff point, and rendering a judgment about whether the chain of custody held.

No amount of span metadata will answer them. You need an agent that reads the trace the way a reviewer reads a case file.

Walking the citation chain

Here is what investigation looks like for the fabricated citation above. The investigator receives the completed trace and the final output. It starts from the claim in the memo and works backward:

Investigation: citation-check / run-4821 / finding-7

Claim in output:
  "Section 4.2 of the Share Purchase Agreement contains
   a change-of-control provision..."

Step 1 — Source attribution
  Drafter (Agent C) attributes this to Analyst finding #4.
  Analyst finding #4 text: "SPA contains change-of-control
    trigger for equity acceleration"
  Analyst source: retriever passage batch #2

Step 2 — Retrieval verification
  Passage batch #2 returned 6 passages.
  Passages from SPA: 3 (sections 1.1, 2.3, 3.1)
  Passages mentioning change-of-control: 1
    → Source: Employment Agreement, Section 7(b)
  Passages from SPA Section 4.2: 0

Step 3 — Content verification
  SPA has 3 sections (verified against document index).
  Section 4.2 does not exist.
  Change-of-control language originates from Employment
    Agreement, not SPA.

Finding: FABRICATED CITATION
  Severity: critical
  Type: source_misattribution + section_hallucination
  Impact: Memo attributes Employment Agreement provisions
    to SPA. Cited section does not exist.
  Blame: Agent C (drafter) fabricated section reference.
    Agent B (analyst) conflated document sources.
    Agent A (retriever) operated correctly.

This is not a dashboard metric. It is a structured investigation that produces a finding with causal attribution—which agent introduced the error, what the error was, and why it matters.

Why this is especially dangerous

Fabricated citations are not like other LLM failure modes. When a model hallucinates a fact with no citation, a careful reader may catch it. When a model produces garbled output, downstream systems often reject it. But a fabricated citation actively impedes verification. It tells the human reviewer: "I checked, and here is where I found it." The reviewer's natural inclination is to trust a specific reference more than a vague claim.

In legal diligence, this means a lawyer may rely on a cited section without re-reading the original document—which is the entire point of having the AI review system in the first place. In medical contexts, a fabricated citation to a specific study or guideline section could directly influence treatment decisions. In financial analysis, a misattributed data point changes the risk profile of a deal.

The failure mode is especially pernicious in multi-agent systems because the provenance chain is long. Agent A retrieves correctly. Agent B synthesizes mostly correctly but loses attribution precision. Agent C rounds the fuzzy attribution into a confident-sounding specific citation. Each agent's output is locally reasonable. The error is emergent.

What investigation requires

Catching fabricated citations requires three capabilities that traditional observability does not provide:

1. Causal blame across agent boundaries. You need to trace the lineage of a specific claim in the final output back through every agent that touched it. This is not span parenting. It is content-level attribution: which agent introduced this specific piece of text, and what was its source? This is what @galea/blame does—it walks the event graph and attributes each output claim to its originating agent and source material.

2. Replay with content verification. You need to re-execute parts of the pipeline to verify that the retrieval results actually support the downstream claims. Not re-running the whole pipeline (expensive, non-deterministic) but targeted replay of specific steps with content comparison. @galea/replay replays retrieval calls and compares the returned content against what downstream agents claimed was retrieved.

3. Priority-scoped severity. A fabricated citation about an immaterial clause in a low-value deal is different from a fabricated citation about a change-of-control provision in a $2B acquisition. The investigation system needs to understand the customer's priorities—what matters to this deployment, for this use case—and score findings accordingly.

These are not features you bolt onto an APM. They require a fundamentally different architecture: an investigator that reads traces as evidence, not as telemetry.

The scope of the problem

The documented scale of the problem is no longer anecdotal. In legal alone, Charlotin's database tracks 1,348 cases. Attorney James Martin Paul was sanctioned for using hallucinated citations across eight separate matters — not a one-time mistake but a systematic pattern. And beyond the courtroom, multi-agent systems across financial, healthcare, and enterprise search verticals show consistent patterns:

  • 5-15% of specific citations in multi-agent outputs contain some form of source misattribution, ranging from wrong section numbers to entirely fabricated references.
  • The rate increases with pipeline depth. Two-agent systems (retrieve + generate) fabricate less than three-agent systems (retrieve + analyze + draft), which fabricate less than four-agent systems with additional review or formatting stages.
  • Retrieval quality has limited impact. Even with high-precision retrieval (recall@10 above 0.9), downstream agents introduce attribution errors during synthesis. The retriever did its job. The citation still broke.
  • Confidence calibration does not help. Agents that express high confidence in their citations are not measurably more accurate. The fabrication is not a function of uncertainty—it is a function of the synthesis process itself.

Every one of these systems reported zero errors in their observability dashboards.

What to do about it

If you are building a multi-agent system that produces cited output, here is the minimum you should be doing:

  1. Log full retrieval content, not just metadata. If your traces only record that a retrieval call returned 6 results, you cannot verify citations after the fact. Log the actual content of retrieved passages alongside source identifiers.
  2. Instrument agent handoffs with content hashes. When Agent B receives output from Agent A, record a hash of what was received. This lets you detect cases where content was modified or lost between agents.
  3. Build citation verification into your eval suite. For every citation in the output, check: (a) does the cited source exist in the retrieval results, (b) does the cited section exist in the source, (c) does the content match the claim. This should run on every trace, not as an offline batch job.
  4. Treat citation accuracy as a first-class metric. Track it over time. Set alerts. A system that slowly drifts from 95% citation accuracy to 88% is degrading in a way that latency and error rate will never reveal.

Or you can use an investigation layer that does this automatically. That is what we are building.


Galea's investigator agent walks every trace, checks retrieval-to-citation chains, and produces structured findings with causal attribution. It does not replace your observability stack. It answers the questions your observability stack was never designed to ask.

The fabricated citation problem is not going away. As agent systems get more complex—more handoffs, more synthesis steps, more autonomy—the gap between "the run completed" and "the output is correct" will only widen. The systems that close that gap will be the ones worth trusting with consequential work.