Free

Evals for AI Agents

Course
23 Lessons

Walk away with a working eval suite for your AI agents; not a generic benchmark, but evals built from real failure modes and human judgment. Learn the three eval types, when to use each, and how to tie them into a regression loop that protects your product every time you ship. Starting next week!

Get now

Content

1. What are Agent Evaluations?

Learn the fundamentals of AI evals, the three core types, and the data you need to get started.

1. Introduction to Evals

2. Generic vs. Targeted Evals

3. Regression Testing - Online vs Offline Evals

4. The 3 Data Pillars of Evaluation

Wrap Up Quiz

Practical 1: Intro to Course Project

2. Human-in-the-Loop Evals

Understand why human judgment is the ground truth all other evals are measured against and how to use it without it becoming a bottleneck.

6. Human-in-the-Loop Evaluation

7. Designing Human Evaluations

8. From Annotations to Patterns

Wrap Up Quiz

Practical 2: Observing and Annotating Your Traces

3. LLM-as-Judge

Learn to automate human-like judgment at scale using a model to score your agent's outputs against criteria you define.

10. LLM-as-a-Judge

11. When to Use LLM-as-Judge

12. Building Effective Judge Prompts

Wrap Up Quiz

Practical 3: Creating Evaluators from Issues

4. Programmatic Rules

Build your first line of defense with deterministic checks that catch structural and compliance failures before they reach users.

14. Programmatic Rule Evaluations

15. When to Use Programmatic Rules

16. Designing Effective Programmatic Rules

17. Integrating the 3 Types of Evals

Wrap Up Quiz

Practical 4: Creating a Golden Dataset

You made it!

Get Your Certificate