Evaluations help you systematically test your agent. Create test cases with expected outcomes, run them against your agent, and track quality over time.Documentation Index
Fetch the complete documentation index at: https://prismeai-docs-next.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Why Evaluate?
Manual testing in the Playground is useful for exploration, but evaluations provide:- Consistency - Same tests run the same way every time
- Regression detection - Know immediately if changes break something
- Quality metrics - Track scores over time
- Documentation - Test cases describe expected behavior
The Evaluate Page
Open any agent and go to the Evaluate section. You’ll see:- Test Cases - Your collection of input/expected output pairs
- Run History - Past evaluation runs and their results
- Current Run - Real-time progress when an evaluation is running
Creating Test Cases
A test case defines:| Field | Description |
|---|---|
| Input | The user message to send |
| Expected Output | What the agent should say or do |
| Criteria | Specific things to check (optional) |
| Tags | Categories for organizing tests |
From Scratch
- Click Create Test Case
- Enter the user input
- Describe the expected output
- Add any specific criteria
- Save
From Playground
When you have a good conversation in the Playground:- Note the user message and agent response
- Go to Evaluate and create a test case
- Use the conversation as your baseline
Import Bulk
For many test cases:- Click Import
- Upload a JSON or CSV file with your test cases
- Map the columns to fields
- Import
Writing Good Test Cases
Be specific about expectations
Instead of “should be helpful”, specify:Cover different scenarios
Build a diverse test suite:- Happy paths (common, expected use)
- Edge cases (unusual inputs)
- Error handling (what happens when things go wrong)
- Out of scope (things the agent shouldn’t do)
Use criteria for precision
Criteria let you check specific aspects:| Criterion | What It Checks |
|---|---|
contains_link | Response includes a URL |
mentions_product | Specific product name appears |
polite_tone | Response is professional |
under_100_words | Response is concise |
Tag for organization
Use tags to group related tests:password-reset- All password-related testsbilling- Payment and subscription testsedge-case- Unusual scenariosregression- Tests for fixed bugs
Running Evaluations
Run All Tests
- Click Run Evaluation
- Wait for all tests to complete
- Review results
Run Single Test
To test one case:- Find the test case
- Click the play button
- See the result inline
Run with Different Models
Compare how different models perform:- Click Run Evaluation
- Select a different model from the dropdown
- Compare results to previous runs
Understanding Results
After a run completes, you’ll see:Summary
- Score - Overall pass rate (0-100%)
- Passed/Failed - Count of each
- Duration - How long the run took
Per-Test Results
For each test case:| Field | Description |
|---|---|
| Status | Pass or Fail |
| Score | How well it matched expectations (0-100) |
| Actual Output | What the agent actually said |
| Feedback | Explanation of scoring |
Regression Detection
When you run multiple evaluations, the system detects:- Improvements - Tests that now pass
- Regressions - Tests that used to pass but now fail
- Score changes - How individual test scores changed
Working with History
Compare Runs
Select two runs to see side-by-side comparison:- Which tests changed status
- Score differences
- What’s improved vs regressed
Delete Runs
To remove old evaluation data:- Find the run in history
- Click the delete button
- Confirm
Export Results
Export evaluation data for reporting:- Select a run
- Click Export
- Download as JSON or CSV
Best Practices
Run before publishing
Run before publishing
Always run your full test suite before publishing changes. Catch regressions before users do.
Add tests for bugs
Add tests for bugs
When you find and fix a bug, add a test case to prevent it from returning.
Review failures carefully
Review failures carefully
A failing test might mean the agent is wrong, or it might mean the test expectation needs updating.
Keep tests maintainable
Keep tests maintainable
Overly specific expectations break easily. Focus on what matters, not exact wording.
Test tool usage
Test tool usage
Include tests that verify the agent uses the right tools for the right tasks.
Advanced: Tool Expectations
For agents with tools, you can specify expected tool usage:- Calls the expected tools with correct arguments
- Doesn’t call forbidden tools
Next Steps
Configure settings
Set up retention, sharing, and safety guardrails
View analytics
See how users interact with your published agent