DocsEvaluationOverview

Evaluation Overview

Evaluation is a critical aspect of developing and deploying LLM applications. Usually, teams use a multitude of different evaluation methods to score the performance of their AI application depending on the use case and the stage of the development process.

🎥

Watch this walkthrough of Langfuse Evaluation and how to use it to improve your LLM application.

Why use LLM Evaluation?

LLM evaluation is crucial for improving the accuracy and robustness of language models, ultimately enhancing the user experience and trust in your AI application. Here are the key benefits:

  • Quality Assurance: Detect hallucinations, factual inaccuracies, and inconsistent outputs to ensure your AI app delivers reliable results
  • Performance Monitoring: Measure response quality, relevance, and user satisfaction across different scenarios and edge cases
  • Continuous Improvement: Identify areas for enhancement and track improvements over time through structured evaluation metrics
  • User Trust: Build confidence in your AI application by demonstrating consistent, high-quality outputs through systematic evaluation
  • Risk Mitigation: Catch potential issues before they reach production users, reducing the likelihood of poor user experiences or reputational damage

Online & Offline Evaluation

Offline Evaluation involves

  • Evaluating the application in a controlled setting
  • Typically using curated test Datasets instead of live user queries
  • Heavily used during development (can be part of CI/CD pipelines) to measure improvements / regressions
  • Repeatable and you can get clear accuracy metrics since you have ground truth.

Online Evaluation involves

  • Evaluating the application in a live, real-world environment, i.e. during actual usage in production.
  • Use Evaluation Methods that track success rates, user satisfaction scores, or other metrics on live traffic
  • Advantage of online evaluation is that it captures things you might not anticipate in a lab setting
  • Can include collecting implicit and explicit user feedback, and possibly running shadow tests or A/B tests

In practice, successful evaluation blends online and offline evaluations. Many teams adopt a loop-like approach. This way, evaluation is continuous and ever-improving.

Continuous evaluation loop

Adapted from: “How to continuously improve LLM products?”, Evidently

GitHub Discussions

Was this page helpful?