DocsEvaluationCustom via SDKs/API

Custom Scores via API/SDKs

Where is this feature available?
  • Hobby
  • Pro
  • Team
  • Self Hosted

Langfuse gives you full flexibility to ingest custom scores via the Langfuse SDKs or API. The scoring workflow allows you to run custom quality checks on the output of your workflows at runtime, or to run custom human evaluation workflows.

Common use cases:

  • Collecting user feedback (example): collect feedback from your users on application quality or performance. Can be captured in the frontend via our Browser SDK.
  • Custom evaluation data pipeline (example): continuously monitor the quality by fetching traces from Langfuse, running custom evaluations, and ingesting scores back into Langfuse.
  • Guardrails and security checks (example): check if output contains a certain keyword, adheres to a specified structure/format or if the output is longer than a certain length.
  • Custom internal workflow tooling: build custom internal tooling that helps you manage human-in-the-loop workflows. Ingest scores back into Langfuse, optionally following your custom schema by referencing a config.
  • Custom run-time evaluations: e.g. track whether the generated SQL code actually worked, or if the structured output was valid JSON.

How to add scores

You can add scores via the Langfuse SDKs or API. Scores can take one of three data types:

  • Numeric: used to record scores that fall into a numerical range
  • Categorical: used to record string score values
  • Boolean: used to record binary score values

SDK ingestion examples by data type

Numeric score values must be provided as float.

Python SDK example

langfuse.score(
    id="unique_id", # optional, can be used as an indempotency key to update the score subsequently
    trace_id=message.trace_id,
    observation_id=message.generation_id, # optional
    name="correctness",
    value=0.9,
    data_type="NUMERIC", # optional, inferred if not provided
    comment="Factually correct", # optional
)

JavaScript/TypeScript SDK example

await langfuse.score({
  id: "unique_id", // optional, can be used as an indempotency key to update the score subsequently
  traceId: message.traceId,
  observationId: message.generationId, // optional
  name: "correctness",
  value: 0.9,
  dataType: "NUMERIC", // optional, inferred if not provided
  comment: "Factually correct", // optional
});

→ More details in Python SDK docs and JS/TS SDK docs. See API reference for more details on POST/GET score configs endpoints.

How to ensure your scores comply with a certain schema

Given your scores are required to follow a specific schema such as data range, name or data type, you can define and reference a score configuration (config) on your scores. Configs are helpful when you want to standardize your scores for future analysis. They can be defined in the Langfuse UI or via our API.

Whenever you provide a config, the score data will be validated against the config. The following rules apply:

  • Score Name: Must equal the config’s name
  • Score Data Type: When provided, must match the config’s data type
  • Score Value: Must match the config’s data type and be within the config’s value range:
    • Numeric: Value must be within the min and max values defined in the config (if provided, min and max are optional and otherwise are assumed as -∞ and +∞ respectively)
    • Categorical: Value must map to one of the categories defined in the config
    • Boolean: Value must equal 0 or 1

Score ingestion referencing configs via SDK

When ingesting numeric scores, you can provide the value as a float. If you provide a configId, the score value will be validated against the config’s numeric range, which might be defined by a minimum and/or maximum value.

langfuse.score(
    trace_id=message.trace_id,
    observation_id=message.generation_id, # optional
    name="accuracy",
    value=0.9,
    comment="Factually correct", # optional
    id="unique_id" # optional, can be used as an indempotency key to update the score subsequently
    config_id="78545-6565-3453654-43543" # optional, to ensure that the score follows a specific min/max value range
    data_type="NUMERIC" # optional, possibly inferred
)
await langfuse.score({
  traceId: message.traceId,
  observationId: message.generationId, // optional
  name: "accuracy",
  value: 0.9,
  comment: "Factually correct", // optional
  id: "unique_id", // optional, can be used as an indempotency key to update the score subsequently
  configId: "78545-6565-3453654-43543", // optional, to ensure that the score follows a specific min/max value range
  dataType: "NUMERIC", // optional, possibly inferred
});

→ More details in Python SDK docs and JS/TS SDK docs. See API reference for more details on POST/GET score configs endpoints.

Creating Score Config object in Langfuse

A score config includes the desired score name, data type, and constraints on score value range such as min and max values for numerical data types and custom categories for categorical data types. See API reference for more details on POST/GET score configs endpoints. Configs are crucial to ensure that scores comply with a specific schema therefore standardizing them for future analysis.

AttributeTypeDescription
idstringUnique identifier of the score config.
namestringName of the score config, e.g. user_feedback, hallucination_eval
dataTypestringCan be either NUMERIC, CATEGORICAL or BOOLEAN
isArchivedbooleanWhether the score config is archived. Defaults to false
minValuenumberOptional: Sets minimum value for numerical scores. If not set, the minimum value defaults to -∞
maxValuenumberOptional: Sets maximum value for numerical scores. If not set, the maximum value defaults to +∞
categorieslistOptional: Defines categories for categorical scores. List of objects with label value pairs
descriptionstringOptional: Provides further description of the score configuration

Detailed Score Ingestion Examples

Certain score properties might be inferred based on your input. If you don’t provide a score data type it will always be inferred. See tables below for details. For boolean and categorical scores, we will provide the score value in both numerical and string format where possible. The score value format that is not provided as input, i.e. the translated value is referred to as the inferred value in the tables below. On read for boolean scores both numerical and string representations of the score value will be returned, e.g. both 1 and True. For categorical scores, the string representation is always provided and a numerical mapping of the category will be produced only if a score config was provided.

For example, let’s assume you’d like to ingest a numeric score to measure accuracy. We have included a table of possible score ingestion scenarios below.

ValueData TypeConfig IdDescriptionInferred Data TypeValid
0.9NullNullData type is inferredNUMERICYes
0.9NUMERICNullNo properties inferredYes
depthNUMERICNullError: data type of value does not match provided data typeNo
0.9NUMERIC78545No properties inferredConditional on config validation
0.9Null78545Data type inferredNUMERICConditional on config validation
depthNUMERIC78545Error: data type of value does not match provided data typeNo

Data pipeline example

You can run custom evaluations on data in Langfuse by fetching traces from Langfuse (e.g. via the Python SDK) and then adding evaluation results as scores back to the traces in Langfuse.

The example notebook is a good template to get started with building your own evaluation pipeline.

Was this page useful?

Questions? We're here to help

Subscribe to updates