Describe any AI agent and get 15 test cases — happy path, edge cases, failure modes, and adversarial — each with pass criteria and fail indicators. Test before you launch.
Normal, representative uses. If your agent fails these, it's not ready. Varied inputs — not five rewrites of the same case.
Valid but unusual inputs: ambiguous phrasing, very short or very long messages, multi-part requests, unexpected-but-legitimate asks.
Out-of-scope requests, inputs that conflict with the agent's purpose, and empty or nonsense input. The agent should decline gracefully.
Prompt injection, jailbreak attempts, and authority claims. Common attack patterns that every public-facing agent should be tested against.
What it does, what good outputs look like, and any known failure modes. The more specific, the better.
Across 4 categories: happy path, edge cases, failure modes, and adversarial. Filter by category to focus.
Send each input to your agent, score against the pass criteria. No special tooling needed.
Use the results to fix gaps before launch, then build with Ace once you've validated the design.
15 test cases split across 4 categories: 5 happy-path cases (normal use), 4 edge cases (valid but unusual inputs), 3 failure-mode cases (out-of-scope requests), and 3 adversarial cases (prompt injection, jailbreak attempts). Each case includes the exact input, expected behaviour, pass criteria, and fail indicators.
Copy each input, send it to your agent, then compare the response against the pass criteria and fail indicators. No special tooling needed — you can run these manually in a spreadsheet or paste them into an automated eval framework.
The cases are generated from your description. The more specific you are about what the agent does, its constraints, and known failure modes, the more targeted the test cases will be. A vague description produces generic cases; a specific one produces cases that expose your actual risks.
Every agent deployed publicly will eventually receive prompt injection attempts and jailbreak requests. Testing for these before launch — not after — is the difference between an embarrassing incident and a non-event. The adversarial cases cover the most common attack patterns.
Yes — click 'Generate a suite for a different agent' after seeing your results. You can refine your description to generate more targeted cases, up to 5 times per hour for free.
No — 15 cases are a starting point, not a complete regression suite. The coverage notes field in your results tells you what categories this suite doesn't cover. As the agent evolves, add cases for new capabilities and discovered failure modes.