How many golden traces do you need before a harness is useful

Twenty well-curated cases covering your highest-stakes decision types will tell you more than 500 randomly sampled ones. The quality of curation matters more than the count. A case belongs in the golden set if a wrong answer has a real operational or financial cost and if a domain expert has reviewed and confirmed what a correct answer looks like. Start there and grow the set as the agent encounters new failure modes in production.

Who should own the evaluation harness in an enterprise AI team

The engineering team owns the infrastructure, the runner, and the assertion framework. The domain experts own the case selection and the plain-language definitions of correct behavior. Neither group can do the other's job well. The failure mode I see most often is engineers writing all the assertions without domain input, which produces a harness that catches engineering errors but misses the judgment calls that actually matter to the business.

Can you evaluate an agent that produces open-ended outputs like summaries or recommendations

Yes, but you have to be honest about what you are measuring. For open-ended outputs, behavioral assertions are more useful than exact-match checks. You are asking whether the agent cited the right sources, stayed within its sanctioned scope, flagged uncertainty when it should have, and avoided categories of output that domain experts have defined as problematic. You can also use a separate model as a judge on specific dimensions, but that introduces its own reliability questions and should be validated against human review before you rely on it.

What is the right cadence for running the evaluation harness

Regression against golden traces should run on every meaningful change, every prompt edit, every retrieval configuration change, every model version update. That is the gate. Separately, a drift monitor on live production traffic should run continuously and alert when the distribution of incoming queries is moving outside the range your golden traces cover. The weekly eval report that goes to the operations team is a summary of both, not a replacement for either.

Evals Are Why Anyone Trusts Your Agent in Production . Alexey Shurov

Replay regression, not gut feel, is what separates an agent a team will actually use from one they quietly route around.

The Agent Nobody Uses

The most dangerous outcome in enterprise agent work is not a spectacular failure. It is a quiet workaround. Someone on the floor, or in the back office, decides the agent is unpredictable and starts doing the task by hand again. They do not file a ticket. They just stop using it. Six months later you have an agent running in production that nobody trusts, a vendor relationship that looks fine on paper, and a team that has learned to smile and nod in demos.

I have seen this in distribution operations, in financial reconciliation, in field service dispatch. The pattern is always the same. The agent worked well enough in the pilot. Then something changed, a data format, a policy, a seasonal edge case, and the agent started producing outputs that felt wrong even when they were technically within spec. The team had no way to know if the agent had gotten worse or if they were just more anxious. And because nobody had built an evaluation harness before go-live, there was no way to answer that question with evidence.

Evals are the thing that makes the difference. Not the model, not the prompt, not the architecture. The evaluation harness is what lets you say, with a straight face, that the agent can be trusted.

What a Real Evaluation Harness Actually Contains

When I say evaluation harness I do not mean a spreadsheet of test cases someone wrote in a sprint and never touched again. I mean a system that runs continuously, that is version-controlled, and that produces a number you would be willing to show a skeptical operations director.

A production harness has four layers. First, a curated set of golden traces. These are real past interactions, selected because they represent decisions that mattered. In a manufacturing context that might be a work order routing decision that a senior planner reviewed and signed off on. In financial operations it might be a flagged transaction that a compliance analyst adjudicated. The key word is curated. You are not capturing everything. You are capturing the cases where a wrong answer had a real cost.

Second, a set of behavioral assertions. Not just did the agent produce the right output, but did it reason in a way that is auditable. Did it cite the right source document. Did it stay within its sanctioned scope. Did it escalate when it should have escalated. These assertions are written by the people who understand the domain, not by the engineers who built the agent.

Third, a regression runner that executes on every meaningful change. Every prompt edit, every retrieval config change, every model version bump triggers a full run against the golden traces. If the pass rate drops more than a defined threshold, the change does not go forward. This is not optional. It is the gate.

Fourth, a drift monitor for production traffic. Because the world changes even when you do not. A field operations agent that handled equipment fault codes correctly in March may start seeing fault code formats that did not exist in March. The harness needs to flag when live traffic is diverging from the distribution your golden traces cover.

Why Replay Regression Beats Vibes Every Time

The alternative to replay regression is what I call vibes-based QA. Someone senior uses the agent for a while, forms an impression, and either blesses it or does not. This is how most enterprise AI pilots get evaluated. It is also why most of them fail to scale.

Vibes-based QA has two fatal problems. The first is that it is not reproducible. If your senior analyst thought the agent was performing well in October and a junior analyst thinks it is performing poorly in February, you have no way to know who is right. You cannot go back and run October's traffic against the current agent. You have no baseline.

The second problem is that it selects for the wrong cases. Humans are good at noticing dramatic failures and bad at noticing systematic drift. An agent that is wrong 30 percent of the time on a specific subcategory of queries will feel fine to a casual user who mostly sends queries from other subcategories. The harness catches it. The vibe does not.

In a finance sector engagement I worked on, the agent was handling a document extraction task. Vibes-based review from the team was positive. When we ran replay regression against a curated set of 400 historical cases, we found that accuracy on one document subtype, a specific amendment format, had dropped from 91 percent to 67 percent after a retrieval config change three weeks earlier. Nobody had noticed because that subtype was only about 12 percent of volume. The harness caught it in the next regression run after we built it. That is the difference between a system you can defend and a system you are hoping holds together.

How Evals Change Team Willingness to Adopt

This is the part that does not get written about enough. Evals are not just an engineering concern. They are a trust artifact for the people who have to stake their professional judgment on what the agent produces.

A compliance analyst in a financial institution is not going to hand off a judgment call to an agent because you told them the model is good. They are going to hand it off when they can see, in a format they understand, what the agent's track record looks like on cases similar to the one in front of them. When they know that the system has been tested against real historical decisions and that there is a defined process for catching regressions, their posture changes. They stop treating the agent as a black box they are responsible for supervising and start treating it as a tool they can use with appropriate confidence.

I have watched this shift happen in manufacturing operations. A planning team that was routing around an agent for three months started using it consistently within six weeks of us standing up a proper harness and publishing a weekly eval report they could read. The report was not technical. It showed pass rates by task category, flagged any cases where the agent had been overridden by a human and why, and noted what had changed since the previous week. That transparency was worth more than any model improvement we made during the same period.

The practical implication is that evals are a communication tool as much as a quality tool. When you can show an operations team a number that is stable over time, and show them the process that keeps it stable, you have given them a reason to commit. Without that, you are asking them to trust something they cannot see.

What Goes Wrong When You Skip This

Skipping the eval harness is a rational short-term decision and a costly long-term one. The short-term logic is real. Building a good harness takes time. Curating golden traces requires domain expert involvement. Writing behavioral assertions requires someone who understands both the domain and the agent's failure modes. None of this is fast.

But the cost of skipping it compounds. Every time you make a change to the agent without regression coverage, you are accumulating technical debt in the form of unknown behavioral drift. Every time you ship a prompt change that felt like an improvement in informal testing, you are gambling on whether it degraded performance on cases you did not think to check.

In distribution operations I have seen this play out over a nine-month period. An agent that started with strong performance on order exception handling gradually drifted as the team iterated on prompts to handle new edge cases. By month nine the agent was handling new edge cases adequately but had regressed significantly on the original core cases. Nobody knew because there was no baseline. The team thought they were improving the agent. They were actually trading known reliability for unknown reliability.

The rebuild cost, including the time to curate golden traces retroactively and re-establish a baseline, was substantially higher than building the harness would have been at the start. This is the standard trajectory when evals get treated as something you add later.

Building the Harness Without Slowing Down the Build

The objection I hear most often is that building a proper eval harness slows down development. This is true in the same way that writing tests slows down software development. It is true in the short run and false in any timeframe that matters.

The way to make it practical is to start small and make it mandatory from the first week. You do not need 400 golden traces to start. You need 20 cases that represent the most consequential decisions the agent will make. You need one or two behavioral assertions per case. You need a runner that executes on every meaningful change. That is achievable in a week of focused work.

The harness grows with the agent. Every time the agent fails in production in a way that matters, that case goes into the golden set. Every time a domain expert reviews an output and has a strong opinion about it, that case is a candidate for the set. Over time you build coverage that reflects the actual risk surface of the system, not a hypothetical one.

The other thing that helps is separating the harness infrastructure from the case curation. Engineers can build the runner and the assertion framework. Domain experts curate the cases and write the assertions in plain language. The translation between plain language assertions and executable checks is a small amount of engineering work that pays back immediately in the form of assertions that actually reflect what the domain expert cares about.

The Short Version for Anyone Who Needs to Act on This

If you are running an AI agent in production and you do not have a regression harness with golden traces and behavioral assertions, you do not actually know if your agent is getting better or worse over time. You have a feeling. Feelings are not defensible when something goes wrong, and they are not persuasive when you are trying to get a skeptical operations team to commit.

Start with the cases that cost the most when the agent gets them wrong. Get a domain expert to review 20 of them and document what a good response looks like and what a bad one looks like. Build a runner that executes those cases on every change. Publish the results somewhere the operations team can see them.

That is the minimum. It is not glamorous. It will not make a good demo slide. But it is the thing that determines whether your agent is still being used in 18 months or whether someone quietly built a workaround and stopped filing tickets about it.

Evals Are Why Anyone Trusts Your Agent in Production