• Free

Evals for AI Agents

  • Course
  • 23 Lessons

Walk away with a working eval suite for your AI agents; not a generic benchmark, but evals built from real failure modes and human judgment. Learn the three eval types, when to use each, and how to tie them into a regression loop that protects your product every time you ship. Starting next week!

Content

1. What are Agent Evaluations?

Learn the fundamentals of AI evals, the three core types, and the data you need to get started.

1. Introduction to Evals
2. Generic vs. Targeted Evals
3. Regression Testing - Online vs Offline Evals
4. The 3 Data Pillars of Evaluation
Wrap Up Quiz
Practical 1: Intro to Course Project

2. Human-in-the-Loop Evals

Understand why human judgment is the ground truth all other evals are measured against and how to use it without it becoming a bottleneck.

6. Human-in-the-Loop Evaluation
7. Designing Human Evaluations
8. From Annotations to Patterns
Wrap Up Quiz
Practical 2: Observing and Annotating Your Traces

3. LLM-as-Judge

Learn to automate human-like judgment at scale using a model to score your agent's outputs against criteria you define.

10. LLM-as-a-Judge
11. When to Use LLM-as-Judge
12. Building Effective Judge Prompts
Wrap Up Quiz
Practical 3: Creating Evaluators from Issues

4. Programmatic Rules

Build your first line of defense with deterministic checks that catch structural and compliance failures before they reach users.

14. Programmatic Rule Evaluations
15. When to Use Programmatic Rules
16. Designing Effective Programmatic Rules
17. Integrating the 3 Types of Evals
Wrap Up Quiz
Practical 4: Creating a Golden Dataset

You made it!

Get Your Certificate