Evaluations
Evaluations are the core of Truesight. They define how your AI outputs are assessed: what criteria to check, how judgments are made, and how individual judgments combine into a final result.
How evaluations work
Truesight evaluations assess each output against your custom criteria using AI-powered judges. The evaluation process includes quality checks and combines multiple assessments to produce reliable, consistent results.
Supported model providers
Truesight supports judges from four built-in model providers:
- OpenAI
- Anthropic
- Google Gemini
- Perplexity
Truesight also supports any additional provider compatible with LiteLLM, giving you access to a wide range of models.
When using Truesight's Managed API keys, evaluations draw from your credit balance. You can bring your own API keys (BYOK) for any provider from the Settings page. When configured, your custom keys take priority and credits are not consumed. See Teams & Organizations for organization-wide key management.
Building evaluations
Guided Setup
The recommended way to create evaluations is through Guided Setup, accessible from the sidebar. The wizard walks you through each step:
1. Data
Upload or select the dataset you want to evaluate. The dataset determines what columns are available for your evaluation.
2. Judgments
Configure the judgment columns for your expert annotations or labels. These are the columns with human assessments that Truesight will use to calibrate the evaluation.
3. Review
Review examples from your dataset and provide expert labels. You can toggle which data columns are visible while reviewing to focus on the content that matters.
4. Finish
Create the evaluation configuration. Truesight generates your evaluation based on the dataset and judgment mappings you've defined.
5. Test
Try your evaluation on sample data to verify it works as expected. Review the results to make sure the evaluation aligns with your expert judgment.
Manual configuration
For advanced users, you can also create evaluations directly from the Evaluations page by clicking Create Evaluation. This gives you full control over the configuration without the step-by-step wizard.
Running evaluations
You can test evaluations in two ways: the Test step at the end of the Guided Setup wizard, or the Test Evaluation page in the sidebar (for active live evaluations). The evaluation processes your inputs and produces results.
Each result includes:
- The final judgment for each criterion
- Reasoning from each judge
- Appeal outcomes (if applicable)
- The aggregated final result
Testing and iteration
Evaluations rarely work perfectly on the first try. The key to building good evaluations is iterating:
- Run on a sample: Start with a small dataset to see how the evaluation performs
- Review edge cases: Look at examples where the evaluation disagrees with your judgment
- Adjust criteria: Tighten or loosen descriptions based on what you see
- Refine instructions: Give judges more specific guidance for ambiguous cases
- Re-run and compare: Track improvement across iterations
Evaluation variants and the Leaderboard
Truesight supports creating variants of evaluations. These are modified versions that you can compare against the original. The Leaderboard shows how variants perform relative to each other, helping you identify the best configuration.
This is useful for:
- A/B testing different judge instructions
- Comparing models across providers
- Experimenting with different criteria formulations
Example criteria
Good criteria descriptions tell the judge exactly what to look for and where the pass/fail boundary is. Here are examples for a customer support chatbot:
| Criterion | Description |
|---|---|
| Factual Accuracy | The response contains only correct information about company policies, products, and procedures. Pass if all stated facts are accurate. Fail if the response includes fabricated policies, wrong prices, or incorrect procedures. |
| Professional Tone | The response is warm, empathetic, and uses language appropriate for the audience. Pass if the tone is helpful and constructive. Fail if the response is dismissive, overly casual, or robotic. |
| Response Completeness | The response addresses the customer's question fully and provides next steps when applicable. Pass if the customer has all the information they need to proceed. Fail if key details or next steps are missing. |
Each description defines what "good" looks like and draws a clear line between pass and fail. This precision helps judges produce consistent, reliable results.
Best practices
- Start simple: Begin with 2 to 3 criteria and add more as needed
- Write clear descriptions: The quality of your criteria descriptions directly affects judge accuracy
- Test before deploying: Always run evaluations on a test dataset before going live
- Use Error Analysis first: Criteria grounded in real data patterns outperform criteria written in the abstract
- Iterate frequently: Small, incremental improvements compound over time