Scores & Evaluation
Model-based Evaluation

Model-based Evaluations in Langfuse

Model-based evaluations are a powerful tool to automate the evaluation of LLM applications integrated with Langfuse. With model-based evalutions, LLMs are used to score a specific session/trace/LLM-call in Langfuse on criteria such as correctness, toxicity, or hallucinations.

Via Python SDK

You can run model-based evals on data in Langfuse via the Python SDK. This gives you full flexibility to run various eval libraries on your production data and discover which work well for your use case. Popular libraries are:

  • OpenAI Evals
  • Langchain Evaluators (Cookbook)
  • RAGAS for RAG applications (Cookbook)
  • UpTrain evals (Cookbook)
  • Whylabs Langkit

Via Langfuse UI

Coming soon: Langfuse evaluation service to run model-based evals directly from the Langfuse UI/Server. Ping us if you are interested to join the beta testing.

Was this page useful?

Questions? We're here to help

Subscribe to updates