Hamel Husain & Shreya Shankar

Why AI evals are the hottest new skill for product builders, covering AI product work, product design, and measurement and analysis.

September 25, 2025·19,553 words

AI & Machine LearningGrowth & MetricsLeadership & ManagementProduct StrategyStartup BuildingDesign & UXEngineeringSales & GTMCareer & Personal GrowthUser PsychologyData & Analytics

Episode

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Summary

Hamel Husain and Shreya Shankar walk through a complete live demonstration of building AI evals using a real property management AI assistant as the example — covering open coding error analysis, axial coding with LLMs, and building an LLM-as-judge. They address major misconceptions and explain why this social-science-inspired process is the highest-ROI activity for AI product builders.

Key Takeaways

Start evals with manual "open coding": sample ~100 traces, write one note per trace on the most upstream error, stop when you stop learning new things (theoretical saturation).

Designate a "benevolent dictator" — one domain expert who owns error analysis — rather than a committee.

Don't automate the open coding step with an LLM: models lack product context to know whether behavior is actually wrong.

After open coding, use an LLM to cluster notes into "axial codes" (failure mode categories), then auto-classify future traces into those buckets.

An LLM-as-judge should use binary pass/fail scoring, be seeded with labeled examples, and you typically only need 4-7 judge prompts per product.

Notable Quotes

“And vibe checks are good and you should do vibe checks initially, but it can become very unmanageable very fast because as your application grows, it's really hard to rely on vibe checks. You just feel lost. And so evals help you create metrics that you can use to measure how your application is doing and kind of give you a way to improve your application with confidence. That you have a feedback signal in which to iterate against.”

AI & Machine LearningData & Analytics

00:07:11

“So maybe it's worth talking about what are axial codes or what's the point here? You have a mess of open codes, and you don't have 100 distinct problems. Actually, many of them are repeats, but because you phrased them differently, and that you shouldn't have tried to create your taxonomy of failures as you're open coding. You just want to get down what's wrong and then organize, "Okay, what's the most common failure mode?"”

AI & Machine LearningEngineering

00:33:54

“And so this is a technique that's been used to analyze stochastic systems for ages, and it's something that it was just using the same machine learning ideas and principles, just bringing them into here, because again, these are stochastic systems.”

AI & Machine LearningCareer & Personal Growth

00:38:45