A guide on implementing effective AI evaluations

13 min read
Share on

In the rapidly evolving landscape of AI product development, one skill has emerged as critically important yet often overlooked: the ability to write and implement effective AI evaluations, or "evals" as they call it. 

As Kevin Weil, CPO of OpenAI, put it succinctly: "Writing evals is going to become a core skill for product managers. It is such a critical part of making a good product with AI." 

This guide demystifies AI evals for product managers, providing practical frameworks, real-world examples, and actionable strategies to help you build more reliable AI products. 

AI evals are structured measurement frameworks that assess how well your AI system performs against defined criteria. Unlike traditional software testing with clear pass/fail scenarios, AI evals measure quality and effectiveness across multiple dimensions. 

To understand why evals matter, consider the case of Microsoft's Tay chatbot from 2016. Released on X (formerly Twitter) as an experiment in conversational understanding, Tay was designed to learn from interactions with users. Within 24 hours, Microsoft had to take Tay offline after it began doing increasingly offensive and inflammatory posts. The chatbot had been manipulated by users who exploited its learning algorithm, teaching it to parrot harmful content. 

There is also a more recent example from Meta. In 2022, Meta released Galactica, an advanced language model for scientific research. Despite initial excitement, Meta pulled the demo within three days because the model generated wildly inaccurate and sometimes offensive scientific content. Both cases demonstrate how proper evals could have identified vulnerabilities before public release, preventing both harmful outputs and significant brand damage. 

"If there is one thing we can teach people, it's that writing evals is probably the most important thing," — Mike Krieger, CPO of Anthropic

Key features of traditional software for testing AI evaluations:

Effective AI evals typically include four key components: setting the role, providing the context, defining the goal, and establishing terminology and labels

Let's examine each through a customer support AI assistant example: 

1. Setting the role: Establishing the context for the evaluating system 

“You are a customer support quality evaluator with expertise in assessing response effectiveness.” 

This establishes who the evaluator is supposed to be and what expertise they should apply when reviewing the interaction. 

2. Providing the context: Supplying the data to be evaluated 

“CONTEXT: 

[BEGIN DATA] 

User: "I ordered headphones with express shipping, but my order still shows 'Processing'. I need them for a presentation on Friday." 

Assistant: "I understand your concern. I see your order #38291 is still 'Processing' despite express shipping. There was a warehouse delay, but your order has been

prioritized for overnight delivery. You'll receive them by Thursday, in time for Friday. I've added a $10 credit for the inconvenience. Would you like a shipping confirmation email?" 

[END DATA]” 

This supplies the complete interaction that needs evaluation, giving the evaluator all the necessary information to make an informed assessment. 

3. Providing the goal: Clearly defining what success looks like 

EVALUATION TASK: 

Evaluate whether the assistant's response effectively addresses the customer's issue. Consider: 

1. Did the assistant acknowledge the customer's specific concern? 

2. Did the assistant provide a clear explanation of the issue? 

3. Did the assistant offer an appropriate solution that meets the customer's need (getting headphones before Friday)? 

4. Did the assistant take appropriate action to remedy the service failure?” 

This clearly defines what aspects should be evaluated and breaks down the specific criteria for success, guiding the evaluator to focus on the most important elements. 

4. Defining scoring criteria: Ensuring consistency in evaluation 

“Rate the response on a 5-point scale: 

1 = Failed to address the issue 

2 = Partially addressed the issue with significant gaps 

3 = Adequately addressed the issue but could improve 

4 = Effectively addressed the issue with minor improvements possible 

5 = Perfectly addressed the issue with appropriate solution and compensation Your response must be a single number between 1-5.” 

This establishes a consistent rating system with clear definitions for each point on the scale, ensuring evaluations are standardized and comparable.

By structuring evaluations using these four components, you can create a comprehensive framework that ensures evaluators understand their role, have complete context, know exactly what to assess, and use consistent criteria for providing feedback. 

Different evaluation approaches serve different purposes in your AI product development cycle:

1. Human evals 

Description: Actual users or expert evaluators provide direct feedback on AI outputs. 

Implementation: Add feedback mechanisms like thumbs up/down buttons or rating scales to your product interface. 

Best for: High-stakes decisions, creative content, establishing "golden" examples. 

Example: Spotify's Podcast AI Summary feature initially used human evaluators to rate summary quality across dimensions like accuracy, comprehensiveness, and readability before building automated evaluation systems. 

2. LLM-as-judge evals 

Description: Using another LLM to evaluate your primary LLM's outputs. 

Implementation: Create prompts that instruct a "judge" LLM to evaluate specific aspects of your primary LLM's responses. 

Best for: Scaling evaluations cost-effectively, consistent application of evaluation criteria. Example prompt structure

You are evaluating the factual accuracy of an AI assistant's response. 

User query: "What year was the first iPhone released?" 

Assistant response: "The first iPhone was released in 2007." 

Assess whether the assistant's response contains any factual errors. 

Your response should be only "accurate" or "inaccurate". 

3. Code-based evals 

Description: Programmatic checks that verify specific aspects of AI outputs.

Implementation: Write functions that test for particular patterns, values, or behaviors in AI responses. 

Best for: Objective criteria like format compliance, runtime errors in generated code, or data validation. 

Example: A function that validates whether AI-generated SQL queries run without syntax errors. 

It bears weight to understand some common eval criteria for AI outputs. While frameworks and methodologies are crucial, knowing what to measure is equally important. The most effective AI teams consistently evaluate these key dimensions: 

Accuracy 

Is the information factually correct? Track error rates and verify outputs against reliable sources. Example metric: Percentage of factual errors per response. 

Relevance 

Does the response actually address what was asked? Use a simple 1-5 scale to measure how well the AI tackles the specific query. Example metric: Relevance score on a 1-5 scale where 5 means the response directly addresses all aspects of the query. 

Coherence 

Can users follow the logic from start to finish? Evaluate the flow of ideas and logical connections between concepts. Example metric: 3-point scale (low/medium/high) rating of response coherence. 

Completeness 

Does the response cover all parts of the question? Check whether all aspects of the query were addressed without requiring follow-ups. Example metric: Percentage of query aspects covered in the response. 

Helpfulness 

Can users actually achieve their goals with this response? This might be the most important criterion—if the response doesn't enable action, users won't return. Example metric: Helpfulness rating from 1-5 based on how well the response enables user action. 

Safety 

Does the output avoid harmful content? Track policy violations and implement robust safety evaluations to prevent misuse. Example metric: Binary flag for presence of unsafe content or numeric risk score. 

Tone and style

Does the AI match your brand voice? Measure outputs against your communication guidelines to ensure consistency. Example metric: Style compliance score rating how well the output matches target tone. 

Contextual Awareness 

Does the AI remember previous interactions? Evaluate how well the system maintains conversation history in multi-turn exchanges. Example metric: Percentage of contextual errors in multi-turn conversations. 

Creativity (when appropriate) 

For creative applications, does the AI bring fresh ideas? Balance originality with usefulness. Example metric: Creativity rating on applications where innovation is desired. 

Different applications will naturally prioritize different criteria. A medical assistant might emphasize accuracy and safety, while a creative writing tool might value coherence and creativity. The key is determining which dimensions matter most for your specific use case and weighting your evaluation framework accordingly. 

When implementing an evaluation system, follow these steps: 

1. Create golden examples: Collect 50-100 diverse user queries and manually label their underlying intent or expected responses 

2. Develop evaluation prompts: Create structured prompts that clearly define what success looks like

3. Generate synthetic test data: Use an LLM to create variations of your golden examples to increase test coverage 

4. Compare model performance: For side-by-side model evaluations, tools like Google's LLM Comparator can help visualize where and why outputs differ between models or prompt versions. This open-source tool lets you slice response data, identify performance patterns, and understand rationales for why one output might be superior to another. 

5. Build an auto-rater: Automate the evaluation process using evaluation prompts with a judge LLM. 

6. Maintain a dashboard: Track performance metrics over time, with breakdowns by relevant dimensions.

The implementation process follows similar steps across all dimensions, with specific adaptations for each. Let's look at how to implement evals for each dimension: 

1. Create golden examples

For all dimensions, you'll need to collect 50-100 diverse examples that represent the range of user interactions your system will encounter. 

2. Develop evaluation prompts 

Tailor your evaluation prompts to each dimension: 

For understanding user intent: 

You are evaluating whether a shopping assistant correctly understands a user's intent. User query: "{user_query}" 

Assistant's understanding: "{assistant_response}" 

Actual intent: "{labeled_intent}" 

Determine if the assistant correctly understood the user's primary intent. Answer ONLY with "correct" or "incorrect". 

For relevance of recommendations: 

You are evaluating the relevance of product recommendations. 

User request: "{user_request}" 

User preferences: "{budget}, {occasion}, {stated_preferences}" 

Assistant's recommendations: "{product_list}" 

Rate how well the recommendations match the user's stated preferences on a scale of 1-5. Answer ONLY with a single number. 

For product knowledge accuracy: This is typically implemented as code rather than a prompt.

3. Generate synthetic test data 

Use LLMs to create variations of your golden examples for each dimension. For example: 

● For User Intent: Generate different ways users might ask for gifts 

● For Relevance: Create diverse preference combinations 

● For Accuracy: Include edge cases with unusual product attributes 

4. Build dimension-specific auto-raters 

Implement automation tailored to each dimension's needs: 

● Intent and Relevance: Use judge LLMs with the appropriate evaluation prompts ● Product Accuracy: Use database validation scripts 

● Conversation Flow: Combine human evaluation with automated context tracking ● Safety: Implement both content policy checkers and LLM judges 

5. Create a unified dashboard 

Track performance metrics over time, with breakdowns by relevant dimensions. This helps you identify trends and areas for improvement. 

Vague evaluation criteria - Define specific, measurable success criteria for each dimension

Inconsistent human evaluations - Periodically validate LLM evaluations against human judgments  

Over-reliance on LLM judges - Evaluate multiple dimensions including tone, creativity, and usefulness 

Focusing only on correctness - Create detailed rubrics with examples for each rating level 

Neglecting edge cases Intentionally include challenging test cases like ambiguous queries

The challenge 

Harvey, a legal AI assistant company, needed a comprehensive evaluation framework to measure how effectively AI models could perform real-world legal tasks. Existing benchmarks relied on multiple-choice questions or one-size-fits-all criteria, which failed to capture the complexity of actual legal work performed by lawyers. 

The eval approach 

The Harvey team developed "BigLaw Bench" - a sophisticated evaluation framework tailored to real legal work: 

1. Task-specific rubrics: They created detailed, bespoke evaluation criteria for each legal task, mapped to actual billable work lawyers perform daily. 

2. Multi-dimensional scoring: Their evaluation methodology included: ○ Answer scores: Positive points for meeting task requirements 

Source scores: Measuring how well models cited sources for their claims ○ Penalty system: Negative points for hallucinations, irrelevant content, and other LLM failure modes 

3. Comprehensive coverage: The framework covered both transactional and litigation tasks across multiple practice areas, including document drafting, legal research, due diligence, and case management. 

4. Real-world applications: All tasks were derived from actual time entries billed by lawyers, ensuring the benchmark reflected genuine legal work. 

The results 

This evaluation framework revealed significant insights about AI model performance: 

● Leading foundation models (GPT-4o, Claude 3.5, Gemini 1.5) produced reasonably strong answers, solving the "blank page problem" and getting users more than halfway to a final work product 

● However, these models often struggled with providing traceable sources for their claims (source scores of 8-24%) 

● Harvey's specialized legal models outperformed general-purpose models, achieving 74% of lawyer-quality work while maintaining higher source scores (68%) ● Transactional tasks generally saw better performance than litigation tasks, which required more complex reasoning and argumentation 

The evaluation framework became central to Harvey's product development, providing clear metrics on which to improve their models while identifying specific shortcomings in general-purpose AI when applied to specialized legal tasks. It also established an objective standard for comparing different AI solutions in the legal domain. Read more about this.

While business metrics like user engagement and conversion rates remain important, AI products require additional monitoring dimensions: 

Dimension key metrics to track warning signs:

As AI becomes integrated into more products, the ability to rigorously evaluate AI systems will increasingly distinguish exceptional product managers from the rest. By mastering the art and science of AI evals, you'll not only build better products but also position yourself as a leader in this rapidly evolving field. 

The product managers who thrive in the AI era will be those who can translate subjective human preferences into structured evaluation frameworks, continuously measuring and improving their AI systems across multiple dimensions. Start building your eval toolkit today, and you'll be well-prepared for the AI-driven product landscape of tomorrow.

● "Beyond vibe checks: A PM's complete guide to evals" by Aman Khan 

The AI Skill That Will Define Your PM Career in 2025 podcast interview with Aman Khan

A Guide to Evaluating LLM-Powered for Product Managers by Product Faculty 

Introducing BigLaw Bench by Harvey 

Google LLM Comparator