Evaluations

Evaluations help you systematically test your agent. Create test cases with expected outcomes, run them against your agent, and track quality over time.

Why Evaluate?

Manual testing in the Playground is useful for exploration, but evaluations provide:

Consistency - Same tests run the same way every time
Regression detection - Know immediately if changes break something
Quality metrics - Track scores over time
Documentation - Test cases describe expected behavior

The Evaluate Page

Open any agent and go to the Evaluate section. You’ll see:

Test Cases - Your collection of input/expected output pairs
Run History - Past evaluation runs and their results
Current Run - Real-time progress when an evaluation is running

Creating Test Cases

A test case defines:

Field	Description
Input	The user message to send
Expected Output	What the agent should say or do
Criteria	Specific things to check (optional)
Tags	Categories for organizing tests

From Scratch

Click Create Test Case
Enter the user input
Describe the expected output
Add any specific criteria
Save

From Playground

When you have a good conversation in the Playground:

Note the user message and agent response
Go to Evaluate and create a test case
Use the conversation as your baseline

Import Bulk

For many test cases:

Click Import
Upload a JSON or CSV file with your test cases
Map the columns to fields
Import

Writing Good Test Cases

Be specific about expectations

Instead of “should be helpful”, specify:

Expected: Agent should search the knowledge base and provide 
a step-by-step guide for resetting the password. Should include 
the link to the self-service portal.

Cover different scenarios

Build a diverse test suite:

Happy paths (common, expected use)
Edge cases (unusual inputs)
Error handling (what happens when things go wrong)
Out of scope (things the agent shouldn’t do)

Use criteria for precision

Criteria let you check specific aspects:

Criterion	What It Checks
`contains_link`	Response includes a URL
`mentions_product`	Specific product name appears
`polite_tone`	Response is professional
`under_100_words`	Response is concise

Tag for organization

Use tags to group related tests:

password-reset - All password-related tests
billing - Payment and subscription tests
edge-case - Unusual scenarios
regression - Tests for fixed bugs

Running Evaluations

Run All Tests

Click Run Evaluation
Wait for all tests to complete
Review results

Run Single Test

To test one case:

Find the test case
Click the play button
See the result inline

Run with Different Models

Compare how different models perform:

Click Run Evaluation
Select a different model from the dropdown
Compare results to previous runs

Understanding Results

After a run completes, you’ll see:

Summary

Score - Overall pass rate (0-100%)
Passed/Failed - Count of each
Duration - How long the run took

Per-Test Results

For each test case:

Field	Description
Status	Pass or Fail
Score	How well it matched expectations (0-100)
Actual Output	What the agent actually said
Feedback	Explanation of scoring

Regression Detection

When you run multiple evaluations, the system detects:

Improvements - Tests that now pass
Regressions - Tests that used to pass but now fail
Score changes - How individual test scores changed

Working with History

Compare Runs

Select two runs to see side-by-side comparison:

Which tests changed status
Score differences
What’s improved vs regressed

Delete Runs

To remove old evaluation data:

Find the run in history
Click the delete button
Confirm

Export Results

Export evaluation data for reporting:

Select a run
Click Export
Download as JSON or CSV

Best Practices

Run before publishing

Always run your full test suite before publishing changes. Catch regressions before users do.

Add tests for bugs

When you find and fix a bug, add a test case to prevent it from returning.

Review failures carefully

A failing test might mean the agent is wrong, or it might mean the test expectation needs updating.

Keep tests maintainable

Overly specific expectations break easily. Focus on what matters, not exact wording.

Test tool usage

Include tests that verify the agent uses the right tools for the right tasks.

Advanced: Tool Expectations

For agents with tools, you can specify expected tool usage:

{
  "input": "What's the weather in Paris?",
  "expected_tools": [
    { "name": "weather_lookup", "arguments": { "city": "Paris" } }
  ],
  "forbidden_tools": ["calendar"]
}

This checks that the agent:

Calls the expected tools with correct arguments
Doesn’t call forbidden tools

Next Steps

Configure settings

Set up retention, sharing, and safety guardrails

View analytics

See how users interact with your published agent

Overview

Chat

Agent Creator

Knowledges

Builder

Governe

Insights (beta)

Why Evaluate?

The Evaluate Page

Creating Test Cases

From Scratch

From Playground

Import Bulk

Writing Good Test Cases

Be specific about expectations

Cover different scenarios

Use criteria for precision

Tag for organization

Running Evaluations

Run All Tests

Run Single Test

Run with Different Models

Understanding Results

Summary

Per-Test Results

Regression Detection

Working with History

Compare Runs

Delete Runs

Export Results

Best Practices

Advanced: Tool Expectations

Next Steps

Configure settings

View analytics

Overview

Chat

Agent Creator

Knowledges

Builder

Governe

Insights (beta)

Documentation Index

​Why Evaluate?

​The Evaluate Page

​Creating Test Cases

​From Scratch

​From Playground

​Import Bulk

​Writing Good Test Cases

​Be specific about expectations

​Cover different scenarios

​Use criteria for precision

​Tag for organization

​Running Evaluations

​Run All Tests

​Run Single Test

​Run with Different Models

​Understanding Results

​Summary

​Per-Test Results

​Regression Detection

​Working with History

​Compare Runs

​Delete Runs

​Export Results

​Best Practices

​Advanced: Tool Expectations

​Next Steps

Configure settings

View analytics

Why Evaluate?

The Evaluate Page

Creating Test Cases

From Scratch

From Playground

Import Bulk

Writing Good Test Cases

Be specific about expectations

Cover different scenarios

Use criteria for precision

Tag for organization

Running Evaluations

Run All Tests

Run Single Test

Run with Different Models

Understanding Results

Summary

Per-Test Results

Regression Detection

Working with History

Compare Runs

Delete Runs

Export Results

Best Practices

Advanced: Tool Expectations

Next Steps