Skip to content

Chapter 16: Manual evaluation for models and apps

Image 16 - Manual

Overview

We recommend that you always start with manual evaluation, with human graders manually scoring generated outputs.

When mitigating specific risks, it's really helpful to keep manually checking progress against a small dataset until evidence of the risk is no longer observed before moving on to automated evaluation.

Azure AI Foundry provides an easy no-code interface for developers or domain experts to grade model outputs.

The Value of Manual Evaluation

Human judgment remains critical for nuanced quality assessment:

  • Build Intuition: Understand failure modes before automating
  • Catch Edge Cases: Identify issues automated metrics might miss
  • Domain Expertise: Leverage subject matter expert knowledge
  • Iterative Refinement: Quickly test mitigations on small datasets
  • Benchmark Creation: Generate ground truth for automated metrics

Azure AI Foundry Manual Evaluation

Azure provides user-friendly tools for manual evaluation:

  • No-Code Interface: Developers and domain experts can grade outputs without coding
  • Annotation Workflows: Structured evaluation with custom criteria
  • Team Collaboration: Multiple graders can assess the same outputs
  • Export and Analysis: Results integrate with automated evaluation pipelines

Start manual, scale automated. Manual evaluation builds the understanding you need to design effective automated evaluation strategies.

Resources and Further Reading

Online Resources

Next Steps

Continue your learning journey:

← Chapter 15 | Chapter 17 →


Questions or feedback? Join the discussion on our GitHub repository or connect with the community.