Skip to main content

Documentation Index

Fetch the complete documentation index at: https://prismeai-docs-next.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Evaluations help you systematically test your agent. Create test cases with expected outcomes, run them against your agent, and track quality over time.

Why Evaluate?

Manual testing in the Playground is useful for exploration, but evaluations provide:
  • Consistency - Same tests run the same way every time
  • Regression detection - Know immediately if changes break something
  • Quality metrics - Track scores over time
  • Documentation - Test cases describe expected behavior

The Evaluate Page

Open any agent and go to the Evaluate section. You’ll see:
  • Test Cases - Your collection of input/expected output pairs
  • Run History - Past evaluation runs and their results
  • Current Run - Real-time progress when an evaluation is running

Creating Test Cases

A test case defines:
FieldDescription
InputThe user message to send
Expected OutputWhat the agent should say or do
CriteriaSpecific things to check (optional)
TagsCategories for organizing tests

From Scratch

  1. Click Create Test Case
  2. Enter the user input
  3. Describe the expected output
  4. Add any specific criteria
  5. Save

From Playground

When you have a good conversation in the Playground:
  1. Note the user message and agent response
  2. Go to Evaluate and create a test case
  3. Use the conversation as your baseline

Import Bulk

For many test cases:
  1. Click Import
  2. Upload a JSON or CSV file with your test cases
  3. Map the columns to fields
  4. Import

Writing Good Test Cases

Be specific about expectations

Instead of “should be helpful”, specify:
Expected: Agent should search the knowledge base and provide 
a step-by-step guide for resetting the password. Should include 
the link to the self-service portal.

Cover different scenarios

Build a diverse test suite:
  • Happy paths (common, expected use)
  • Edge cases (unusual inputs)
  • Error handling (what happens when things go wrong)
  • Out of scope (things the agent shouldn’t do)

Use criteria for precision

Criteria let you check specific aspects:
CriterionWhat It Checks
contains_linkResponse includes a URL
mentions_productSpecific product name appears
polite_toneResponse is professional
under_100_wordsResponse is concise

Tag for organization

Use tags to group related tests:
  • password-reset - All password-related tests
  • billing - Payment and subscription tests
  • edge-case - Unusual scenarios
  • regression - Tests for fixed bugs

Running Evaluations

Run All Tests

  1. Click Run Evaluation
  2. Wait for all tests to complete
  3. Review results

Run Single Test

To test one case:
  1. Find the test case
  2. Click the play button
  3. See the result inline

Run with Different Models

Compare how different models perform:
  1. Click Run Evaluation
  2. Select a different model from the dropdown
  3. Compare results to previous runs

Understanding Results

After a run completes, you’ll see:

Summary

  • Score - Overall pass rate (0-100%)
  • Passed/Failed - Count of each
  • Duration - How long the run took

Per-Test Results

For each test case:
FieldDescription
StatusPass or Fail
ScoreHow well it matched expectations (0-100)
Actual OutputWhat the agent actually said
FeedbackExplanation of scoring

Regression Detection

When you run multiple evaluations, the system detects:
  • Improvements - Tests that now pass
  • Regressions - Tests that used to pass but now fail
  • Score changes - How individual test scores changed

Working with History

Compare Runs

Select two runs to see side-by-side comparison:
  • Which tests changed status
  • Score differences
  • What’s improved vs regressed

Delete Runs

To remove old evaluation data:
  1. Find the run in history
  2. Click the delete button
  3. Confirm

Export Results

Export evaluation data for reporting:
  1. Select a run
  2. Click Export
  3. Download as JSON or CSV

Best Practices

Always run your full test suite before publishing changes. Catch regressions before users do.
When you find and fix a bug, add a test case to prevent it from returning.
A failing test might mean the agent is wrong, or it might mean the test expectation needs updating.
Overly specific expectations break easily. Focus on what matters, not exact wording.
Include tests that verify the agent uses the right tools for the right tasks.

Advanced: Tool Expectations

For agents with tools, you can specify expected tool usage:
{
  "input": "What's the weather in Paris?",
  "expected_tools": [
    { "name": "weather_lookup", "arguments": { "city": "Paris" } }
  ],
  "forbidden_tools": ["calendar"]
}
This checks that the agent:
  • Calls the expected tools with correct arguments
  • Doesn’t call forbidden tools

Next Steps

Configure settings

Set up retention, sharing, and safety guardrails

View analytics

See how users interact with your published agent