← GlossaryGlossary

What is Agent eval?

Agent eval is the practice of running an AI agent against test cases to measure whether it does the job correctly — so you can ship with confidence and catch regressions.

Agent eval — short for evaluation — is the practice of testing your agent the way you'd test any other piece of software: against a set of inputs you know the right answers to. You measure how often the agent gets it right, where it gets it wrong, and whether changes you make help or hurt.

An eval has three parts. A dataset of test cases (real or representative inputs you've curated). A scoring function (how you decide whether the agent's output was correct). And a harness that runs the agent against every case and reports the results. For some kinds of output, scoring is easy — did the agent book the right meeting? For others, scoring is harder — was the email reply good? — and you often use another LLM to judge.

Without eval, shipping changes to an agent is guesswork. You think the latest tweak made it better, but you have no way to prove it didn't quietly make it worse on something you weren't watching. With eval, you have a number that goes up or down — and a regression alarm when it drops.

A simple example

A lead-qualifier agent has 50 sample conversations from last quarter, each labelled with whether the lead was a real fit (yes / no) and what should have happened (book / nurture / discard). When you change the agent's qualifying logic, you re-run the eval. If it correctly classified 47/50 before and 49/50 after, you ship. If it dropped to 42/50, you don't.

Why it matters.

Eval is what separates production agents from demo agents. Demo agents look great in cherry-picked scenarios. Production agents prove they work across the messy variety of real inputs. The only way to know which one you have is to measure.

For non-technical operators, the practical implication is that you should pick a platform with eval baked in. Building your own eval harness is a significant engineering project. A platform that ships with eval lets you focus on improving the agent itself.

The risk in not running eval is silent regression. You make a change to fix one bug and break two others. By the time customers notice, you've lost trust. Eval prevents that, but only if you actually use it — having it and not running it is the worst of both worlds.

How Squidgy handles it

Agent eval on Squidgy.

Every Squidgy agent ships with an eval suite. You can build your test cases by hand, import them from real conversation history, or have Ace generate them from your agent description. Every change to the agent triggers the suite — you see results before you ship, not after.

For marketplace listings, eval pass rates are a transparency requirement. Buyers see the rate before they install. Sellers see what's failing and can improve. Quality compounds because the platform makes quality measurable.

Apply for the Squidgy Agent Builder →Browse the marketplace →

Frequently asked

Common questions about agent eval.

Eval vs testing — same thing?+

Closely related. Software testing usually checks that code does what you specified. Eval checks that an AI agent's behaviour is good across realistic inputs — which is harder because the right answer often isn't a fixed string. Eval techniques include LLM-as-judge, rubric scoring, and human review.

How many test cases do I need?+

Enough to cover the variation your real users will throw at the agent. For a narrow agent, 30–50 well-chosen cases is often enough to start. For broader agents, hundreds. The trick is variety — many similar cases give you false confidence.

Can I write evals without coding?+

On Squidgy, yes. You describe what good looks like in plain English (with optional rules) and the platform builds the eval. Power users can also import test sets from CSVs or past conversations.

What's a good pass rate?+

Depends on the cost of being wrong. For an agent that drafts emails for human approval, 80% is fine — humans catch the rest. For an agent that books appointments unattended, you want 95%+ before going live, and tight escalation when it's unsure.

Related terms

Keep going.

Glossary

What is AI agent?

An AI agent is software that takes a goal, decides what steps to take, uses tools to do them, and carries the work out with little or no human prodding between steps.

Glossary

What is Agent builder?

An agent builder is a tool for creating AI agents — defining what they do, what tools they can use, and how they decide — without writing all the code yourself.

Glossary

What is Agent marketplace?

An AI agent marketplace is a catalog where you can buy, install, or hire pre-built AI agents — like an app store, but for agents that work for you.

Glossary

What is Multi-agent system?

A multi-agent system uses several specialised AI agents that each handle part of a task, coordinated by an orchestrator — like a team rather than a soloist.

Build your own AI agent.

No code. Hands-on onboarding from the team in your first cohort.

Apply for early access →← All terms