Here is a question that sounds simple but breaks most observability tools: was this agent workflow run actually good?
Not "did it finish." Not "were there errors." Good. As in: did it do the right thing, at acceptable cost, without creating risk that the business cares about?
The answer depends entirely on who is asking. And that is the problem with every generic dashboard on the market today.
Four companies, four definitions of failure
Consider four real companies deploying AI agents into production workflows. Each has agents making decisions that affect their business. Each has a completely different definition of what "going wrong" looks like.
Harvey builds AI for legal M&A diligence. Their agents read data rooms, extract clauses, and draft memos that attorneys rely on. A run that completes in 90 seconds and costs $0.40 in tokens is worthless if the memo cites a document that was never in the data room. For Harvey, correctness of citations is existential — ask Sullivan & Cromwell, who in April 2026 had to apologize to a bankruptcy judge after an emergency motion contained 42 inaccuracies including fabricated case citations. Audit trail completeness matters because regulators may review the work. Latency is a nice-to-have.
Decagon automates customer support. Their agents resolve tickets, issue refunds, update accounts. A run that produces a perfectly accurate response but takes 45 seconds is a failure because the customer already left. A run that responds instantly but issues a $500 refund the customer was not entitled to is a different kind of failure entirely — in February 2024, Air Canada was held legally liable for a refund its chatbot fabricated, after a tribunal ruled that companies are bound by what their AI agents promise. For Decagon, tool safety and latency are where the risk concentrates. Citation accuracy barely registers.
Cursor ships an agentic coding IDE. Their agents edit files, run terminal commands, install packages. A run where the agent writes correct code but deletes a production config file along the way is catastrophic — exactly what happened to PocketOS in April 2026, when a Cursor agent powered by Claude Opus deleted their entire production database and backups in nine seconds during a routine staging task. For Cursor, tool safety is the priority that keeps the team up at night, followed by correctness of code suggestions.
Abridge builds clinical AI that generates medical visit summaries. Their agents listen to doctor-patient conversations and produce notes that go into the medical record. A summary that invents a medication the patient never mentioned is a patient safety issue. A summary that leaks protected health information to an unauthorized system is a HIPAA violation. For Abridge, correctness and PHI exposure are the axes that matter. Cost per summary is a distant concern.
Same technology. Same trace format. Completely different definitions of what counts as a problem.
What generic dashboards show you
Open LangSmith, Datadog LLM Observability, Braintrust, or any general-purpose observability tool. You get a competent set of metrics: latency distributions, token counts, error rates, span trees, cost per run. These are useful. They are also the same for every customer.
The dashboard does not know that Harvey needs every citation verified against the data room. It does not know that Decagon needs refund actions flagged for review above a threshold. It does not know that Cursor needs file-deletion events treated as critical. It does not know that Abridge needs PHI in model outputs detected and blocked.
These tools answer "what happened" with admirable precision. They do not answer "should we be worried" because that question requires context they do not have.
A "successful" run for Harvey that misses a material risk in a $500M acquisition is catastrophically different from a "successful" run for Decagon that takes 200ms longer than expected. Generic dashboards treat both the same way: green checkmark, run completed, move on.
The 10-axis priority model
Galea gives every customer a priority configuration with ten axes. Each axis represents a distinct dimension of agent workflow quality:
correctness # output accuracy, factual grounding, citation validity
audit # completeness of decision trail, reproducibility
regulatory_compliance # adherence to domain-specific regulations
cost # token spend, API calls, compute relative to baseline
latency # end-to-end response time, time-to-first-token
throughput # concurrent capacity, queue depth, batch efficiency
tool_safety # risk profile of tool calls, unauthorized actions
memory_safety # context window management, retrieval hygiene
privacy_phi # protected health information exposure
privacy_pii # personally identifiable information exposure
Each customer sets weights for these axes. The weights are not just filters on a dashboard. They change what the investigator agent looks for when it walks a trace.
Here is what the weight configuration looks like across the four companies we discussed:
| Priority axis | Harvey | Decagon | Cursor | Abridge |
|---|---|---|---|---|
| correctness | 0.95 | 0.60 | 0.85 | 0.95 |
| audit | 0.90 | 0.30 | 0.10 | 0.85 |
| regulatory_compliance | 0.85 | 0.40 | 0.05 | 0.90 |
| cost | 0.30 | 0.80 | 0.50 | 0.25 |
| latency | 0.15 | 0.85 | 0.80 | 0.30 |
| throughput | 0.10 | 0.70 | 0.60 | 0.20 |
| tool_safety | 0.50 | 0.90 | 0.95 | 0.50 |
| memory_safety | 0.40 | 0.35 | 0.55 | 0.40 |
| privacy_phi | 0.10 | 0.10 | 0.05 | 0.95 |
| privacy_pii | 0.45 | 0.65 | 0.20 | 0.70 |
Read the columns, not the rows. Each column tells a story about what kind of company this is and what keeps their team awake.
Same trace, different investigation
Here is where this gets concrete. Suppose the investigator finds a single anomaly in a trace: the agent called a tool that modified external state without explicit user approval. In a generic dashboard, this either gets flagged for everyone or for no one.
In Galea, the finding gets scored against the customer's priority weights. The same raw finding produces different severity levels for different customers:
Unreviewed state mutation
Agent modified external document without approval. Audit weight 0.90, tool_safety weight 0.50.
SEVERITY: INFO
Flagged for audit trail but not blocking. Harvey's investigator cares more about whether the content was correct than whether the tool needed approval.
Unreviewed state mutation
Agent modified external account state without approval. Tool_safety weight 0.90, cost weight 0.80.
SEVERITY: ERROR
Blocked. The agent issued an account change without human review. For Decagon, this is the exact failure mode that costs real money.
Same finding. Same trace event. Completely different response. This is not filtering after the fact. The priority weights influence what the investigator looks for in the first place, what evidence it gathers to support or dismiss a concern, and how it frames the finding for the human reviewer.
This changes everything downstream
Priority weights are not just a scoring mechanism. They propagate through the entire Galea loop:
- Investigation scope. The investigator agent spends more time verifying findings on high-weight axes. For Harvey, it cross-references every citation against retrieved documents. For Decagon, it checks every tool call against the authorized action list.
- Anomaly detection baselines. What counts as "anomalous" changes per customer. A 3x token spike is noise for a legal M&A workflow that routinely processes 200-page contracts. The same spike is a red flag for a support bot that should resolve tickets in under 2,000 tokens.
- Eval recommendations. When Galea recommends converting a finding into a durable eval, the eval type matches the priority. Harvey gets citation-verification evals. Decagon gets tool-authorization evals. Abridge gets PHI-detection evals.
- Audit reports. Signed audit exports emphasize different findings for different customers. A compliance team reviewing Harvey's audit sees correctness and regulatory findings front and center. An ops team reviewing Decagon's audit sees cost efficiency and tool safety.
- Alerting thresholds. A correctness concern at 0.95 weight triggers immediately for Harvey. The same correctness concern at 0.60 weight for Decagon gets batched into a weekly digest.
Why this can't be bolted on
You might think: just add priority filters to an existing dashboard. Let customers toggle which metrics they see. Problem solved.
It is not that simple. Filtering changes what you display. Priorities change what you investigate. The distinction matters because investigation is not a query over static data. It is an active process where an agent walks a trace, applies judgment, and produces narrative findings. If the agent does not know that Harvey cares about citation validity at 0.95, it will not spend the extra cycles verifying that every cited document actually exists in the retrieved context.
Observability tools that bolt on "custom views" give you a different lens on the same data. Galea gives you different data, because the investigation itself is different.
The product insight
Agent workflows are becoming business workflows. And businesses are not generic. A legal AI company and a customer support automation company share almost no overlap in what "risk" means, even though the underlying technology is the same LLMs calling the same tools.
The observability layer that wins this market will not be the one with the best span viewer or the prettiest latency chart. It will be the one that understands each customer's priorities deeply enough to tell them, after every run, whether the thing they actually care about went well or went wrong.
That is what we are building at Galea. Not a dashboard that shows everything to everyone. An investigation layer that knows what matters to you.
Galea is in private design partnership. If you are deploying agent workflows and your failure modes are specific to your business, we want to talk. [email protected]