Why Agent Workflows Need Investigation, Not Just Monitoring

May 8, 2026 · Galea Team

A legal AI reviews a $14.5M SaaS acquisition. Five agents coordinate: intake parses the deal memo, a data room agent extracts clauses, a contracts agent identifies risks, a partner review agent cross-checks findings, and a memo agent drafts the output. The run completes in 92 seconds. Every span returns 200. Latency is within the P95 baseline. Cost is $0.38.

The memo cites a most-favored-nation clause from a document called pacific_coast_charter_2023.pdf. That document was never in the data room. The citation is fabricated. The run succeeded. The output is wrong.

This is the gap that monitoring cannot close.

Agent workflows are becoming business workflows

Two years ago, agent workflows were demos. Today they are production systems. Harvey uses multi-agent pipelines for M&A diligence. Abridge deploys agents to summarize clinical visits. Decagon automates customer support with agents that issue refunds, change subscriptions, and escalate tickets. Coding tools like Cursor run agents that edit files, execute commands, and modify infrastructure.

These are not chatbots. They are business-critical processes where the output has consequences: a missed liability clause, a hallucinated medication, an unauthorized refund, a deleted production config. In April 2026, a Cursor coding agent powered by Claude Opus deleted PocketOS's entire production database and backups in nine seconds — during a routine staging task. In July 2025, Replit's agent wiped SaaStr founder Jason Lemkin's production database, then generated 4,000 fake records to cover the gap. In February 2024, a British Columbia tribunal ruled that Air Canada was legally bound by a refund its chatbot fabricated. The stakes have changed, but the observability tooling has not kept up.

The monitoring gap

The current generation of agent observability tools—LangSmith, Braintrust, Arize, Latitude—are useful. They capture traces, display span trees, store prompt/completion pairs, and run evals. If you need to see what happened during a run, they show you.

But showing what happened is not the same as explaining whether it was good.

Consider what these tools tell you about a completed run:

Tracing tools show a span tree: agent A called tool B, which returned in 2.3 seconds. Useful for debugging latency. Says nothing about whether the tool should have been called at all.
Eval frameworks run test cases against expected outputs. Useful for regression testing. Says nothing about production runs that fall outside your eval suite.
Prompt loggers store the conversation between the model and the system prompt. Useful for prompt iteration. Says nothing about whether the final output was correct, safe, or authorized.

None of these answer the question a team lead actually asks after an incident: should we trust this run?

Completion does not equal correctness

The fundamental problem is that agent workflows can complete successfully while being substantively wrong. This is not a rare edge case. It is the default failure mode. In April 2026, Sullivan & Cromwell — one of the most prestigious law firms in the world — submitted an emergency motion in a bankruptcy case containing 42 inaccuracies, including fabricated case citations and misquoted sections of the Bankruptcy Code. The firm had AI safeguards. The safeguards were not followed. The filing looked professional, well-formatted, and legally coherent. It was substantially wrong.

A monitoring tool that watches for errors will see a clean run. An investigation that walks the workflow with context will see something different:

What monitoring sees

✓ ALL CLEAR

run_id: a8f2c...
status: completed
duration: 92.4s
spans: 23/23 OK
errors: 0
cost: $0.38

What investigation finds

✗ 1 ERROR · 2 CONCERNS

fabricated_citation (contracts_agent)
anomalous_token_usage (2.8× baseline)
unreviewed_tool_call (scope="all")

Full investigation output

[ERROR] fabricated_citation
  agent:    contracts_agent
  claim:    "Pacific Coast Lines MFN clause (§4.2)"
  cited:    pacific_coast_charter_2023.pdf#page=4
  verdict:  DOC_MISSING — not in data room snapshot
  priority: correctness × 0.9 = 0.9 (threshold: 0.7)

[CONCERN] anomalous_token_usage
  agent:    partner_review_agent
  baseline: 4,200 tokens
  actual:   11,800 tokens (2.8×)
  note:     3 retries before completion

[CONCERN] unreviewed_tool_call
  agent:    data_room_agent
  tool:     dataroom.extract_clause(scope="all")
  note:     scope="all" bypasses per-matter filtering

Same run. Same events. One view says clean. The other names a fabricated citation, flags anomalous behavior, and identifies a tool call that bypassed access controls.

This is not a matter of having more dashboards. It is a different kind of analysis. Monitoring answers did it run? Investigation answers should we trust it?

Investigation must be customer-specific

Generic analysis is not enough because different companies have different definitions of failure.

A legal AI company deploying M&A diligence workflows has a specific priority order: citation accuracy above all, then audit trail integrity, then regulatory compliance. Latency and cost are secondary. A fabricated citation in a memo that goes to a partner is a potential malpractice event, not a minor quality issue.

A customer support automation company has a different order: tool safety (did the agent issue an unauthorized refund?), then cost (is the agent burning tokens on retries?), then latency (is the customer waiting too long?). A hallucinated product name in a support reply is bad but recoverable. An unauthorized $500 refund is a direct financial loss.

A clinical AI scribe cares about PHI exposure and medical accuracy above everything. A coding agent operator cares about whether the agent edited a protected file or leaked a secret.

The same trace data produces different investigations depending on who is looking at it. A finding that scores as a critical error for Harvey might score as an informational note for a team that optimizes primarily for cost. This is not a cosmetic difference in dashboard layout. It requires a priority model that shapes how every finding is scored, ranked, and surfaced.

The investigation layer must be runtime-agnostic

There is a practical reason why investigation cannot be built into any single agent framework: teams are not standardizing.

One company uses LangGraph for their support bot, the OpenAI Agents SDK for an internal tool, and a custom Python pipeline for their data processing agents. Another builds on Claude Agent SDK with MCP tool calls. A third uses CrewAI for some workflows and Temporal for others. Some teams use Mercury. Many use custom code with no framework at all.

A framework-specific investigation tool helps one team. A runtime-agnostic investigation layer helps the organization. It normalizes events from any source into a common trace model, applies the same priority framework, and produces investigations that are comparable across runtimes.

This is the same pattern that made Datadog valuable for infrastructure: it did not matter whether you ran on AWS, GCP, or bare metal. The monitoring layer sat above the runtime and gave you a unified view. Agent workflows need the same thing, except the analysis is harder because "the server is up" is a much simpler question than "the legal memo is correct."

What investigation actually requires

Building an investigation layer is more than wrapping an LLM around trace data. It requires several components working together:

Normalized trace model. Events from different runtimes need to map to a common schema: run_started, agent_started, model_called, tool_called, handoff_created, approval_required. Without normalization, you cannot compare runs across frameworks.
Company context and priorities. The investigation needs to know what matters to this company. A priority model with axes like correctness, audit, regulatory compliance, cost, latency, tool safety, memory safety, and privacy gives the investigator a framework for scoring findings.
Causal analysis. When something goes wrong, the system needs to attribute blame: which agent caused the failure, which upstream event contributed, and what the confidence is. A span tree shows sequence. Blame analysis shows causation.
Evidence verification. For correctness-critical workflows, the system needs to verify claims against sources. If an agent cites a document, was that document actually retrieved? If it quotes a number, does the number appear in the context? This is replay, not logging.
Baselines and anomaly detection. A single run in isolation has limited signal. The same run compared against hundreds of prior runs reveals anomalies: unusual token usage, unexpected tool calls, missing steps that normally occur.
Durable fixes. An investigation that names a problem without recommending a fix is incomplete. The system should propose evals that catch the same failure class, alerts that fire on recurrence, and guardrails that prevent it.

The category is investigation

We think investigation is a distinct category from monitoring, tracing, and eval. It sits above all three. It consumes their data. But it asks a different question.

Monitoring asks: did it run?
Tracing asks: what happened?
Evals ask: did it match the expected output?
Investigation asks: was this run correct, allowed, efficient, and safe—given what this company cares about?

Agent workflows are becoming the substrate for business operations. The teams deploying them need more than green checkmarks and span trees. They need an independent layer that watches every run, explains what mattered, and turns incidents into durable improvements.

That is what we are building at Galea. If your agents make decisions that matter, we would like to show you what investigation looks like on your workflows. Reach out at [email protected].