Agent eval is the practice of running an AI agent against test cases to measure whether it does the job correctly — so you can ship with confidence and catch regressions.
Agent eval — short for evaluation — is the practice of testing your agent the way you'd test any other piece of software: against a set of inputs you know the right answers to. You measure how often the agent gets it right, where it gets it wrong, and whether changes you make help or hurt.
An eval has three parts. A dataset of test cases (real or representative inputs you've curated). A scoring function (how you decide whether the agent's output was correct). And a harness that runs the agent against every case and reports the results. For some kinds of output, scoring is easy — did the agent book the right meeting? For others, scoring is harder — was the email reply good? — and you often use another LLM to judge.
Without eval, shipping changes to an agent is guesswork. You think the latest tweak made it better, but you have no way to prove it didn't quietly make it worse on something you weren't watching. With eval, you have a number that goes up or down — and a regression alarm when it drops.
A lead-qualifier agent has 50 sample conversations from last quarter, each labelled with whether the lead was a real fit (yes / no) and what should have happened (book / nurture / discard). When you change the agent's qualifying logic, you re-run the eval. If it correctly classified 47/50 before and 49/50 after, you ship. If it dropped to 42/50, you don't.
Eval is what separates production agents from demo agents. Demo agents look great in cherry-picked scenarios. Production agents prove they work across the messy variety of real inputs. The only way to know which one you have is to measure.
For non-technical operators, the practical implication is that you should pick a platform with eval baked in. Building your own eval harness is a significant engineering project. A platform that ships with eval lets you focus on improving the agent itself.
The risk in not running eval is silent regression. You make a change to fix one bug and break two others. By the time customers notice, you've lost trust. Eval prevents that, but only if you actually use it — having it and not running it is the worst of both worlds.
Every Squidgy agent ships with an eval suite. You can build your test cases by hand, import them from real conversation history, or have Ace generate them from your agent description. Every change to the agent triggers the suite — you see results before you ship, not after.
For marketplace listings, eval pass rates are a transparency requirement. Buyers see the rate before they install. Sellers see what's failing and can improve. Quality compounds because the platform makes quality measurable.
Closely related. Software testing usually checks that code does what you specified. Eval checks that an AI agent's behaviour is good across realistic inputs — which is harder because the right answer often isn't a fixed string. Eval techniques include LLM-as-judge, rubric scoring, and human review.
Enough to cover the variation your real users will throw at the agent. For a narrow agent, 30–50 well-chosen cases is often enough to start. For broader agents, hundreds. The trick is variety — many similar cases give you false confidence.
On Squidgy, yes. You describe what good looks like in plain English (with optional rules) and the platform builds the eval. Power users can also import test sets from CSVs or past conversations.
Depends on the cost of being wrong. For an agent that drafts emails for human approval, 80% is fine — humans catch the rest. For an agent that books appointments unattended, you want 95%+ before going live, and tight escalation when it's unsure.
Glossary
What is AI agent?
An AI agent is software that takes a goal, decides what steps to take, uses tools to do them, and carries the work out with little or no human prodding between steps.
Glossary
What is Agent builder?
An agent builder is a tool for creating AI agents — defining what they do, what tools they can use, and how they decide — without writing all the code yourself.
Glossary
What is Agent marketplace?
An AI agent marketplace is a catalog where you can buy, install, or hire pre-built AI agents — like an app store, but for agents that work for you.
Glossary
What is Multi-agent system?
A multi-agent system uses several specialised AI agents that each handle part of a task, coordinated by an orchestrator — like a team rather than a soloist.
No code. Hands-on onboarding from the team in your first cohort.