Custom Scores via API/SDKs

Custom Scores are the most flexible way to implement evaluation workflows using Langfuse. As any other evaluation method the purpose of custom scores is to assign evaluations metrics to Traces, Observations or Sessionsor DatasetRuns via the Score object (see Scores Data Model).

This is achieved by ingesting scores via the Langfuse SDKs or API.

Common Use Cases

  • Collecting user feedback: collect in-app feedback from your users on application quality or performance. Can be captured in the frontend via our Browser SDK. -> Example Notebook

  • Custom evaluation data pipeline: continuously monitor the quality by fetching traces from Langfuse, running custom evaluations, and ingesting scores back into Langfuse. -> Example Notebook

  • Guardrails and security checks: check if output contains a certain keyword, adheres to a specified structure/format or if the output is longer than a certain length. -> Example Notebook

  • Custom internal workflow tooling: build custom internal tooling that helps you manage human-in-the-loop workflows. Ingest scores back into Langfuse, optionally following your custom schema by referencing a config.

  • Custom run-time evaluations: e.g. track whether the generated SQL code actually worked, or if the structured output was valid JSON.

Ingesting Scores

You can add scores via the Langfuse SDKs or API. Scores can take one of three data types: Numeric, Categorical or Boolean.

Here are examples by Score data types

Numeric score values must be provided as float.

from langfuse import get_client
langfuse = get_client()
 
# Method 1: Score via low-level method
langfuse.create_score(
    name="correctness",
    value=0.9,
    trace_id="trace_id_here",
    observation_id="observation_id_here", # optional
    data_type="NUMERIC", # optional, inferred if not provided
    comment="Factually correct", # optional
)
 
# Method 2: Score current span/generation (within context)
with langfuse.start_as_current_span(name="my-operation") as span:
    # Score the current span
    span.score(
        name="correctness",
        value=0.9,
        data_type="NUMERIC",
        comment="Factually correct"
    )
 
    # Score the trace
    span.score_trace(
        name="overall_quality",
        value=0.95,
        data_type="NUMERIC"
    )
 
 
# Method 3: Score via the current context
with langfuse.start_as_current_span(name="my-operation"):
    # Score the current span
    langfuse.score_current_span(
        name="correctness",
        value=0.9,
        data_type="NUMERIC",
        comment="Factually correct"
    )
 
    # Score the trace
    langfuse.score_current_trace(
        name="overall_quality",
        value=0.95,
        data_type="NUMERIC"
    )

→ More details in Python SDK docs and JS/TS SDK docs. See API reference for more details on POST/GET score configs endpoints.

Preventing Duplicate Scores

By default, Langfuse allows for multiple scores of the same name on the same trace. This is useful if you’d like to track the evolution of a score over time or if e.g. you’ve received multiple user feedback scores on the same trace.

In some cases, you want to prevent this behavior or update an existing score. This can be achieved by creating an idempotency key on the score and add this as an id when creating the score, e.g. <trace_id>-<score_name>.

Enforcing a Score Config

Score configs are helpful when you want to standardize your scores for future analysis.

To enforce a score config, you can provide a configId when creating a score to reference a ScoreConfig that was previously created. Score Configs can be defined in the Langfuse UI or via our API. See our guide on how to create and manage score configs.

Whenever you provide a ScoreConfig, the score data will be validated against the config. The following rules apply:

  • Score Name: Must equal the config’s name
  • Score Data Type: When provided, must match the config’s data type
  • Score Value when Type is numeric: Value must be within the min and max values defined in the config (if provided, min and max are optional and otherwise are assumed as -∞ and +∞ respectively)
  • Score Value when Type is categorical: Value must map to one of the categories defined in the config
  • Score Value when Type is boolean: Value must equal 0 or 1

When ingesting numeric scores, you can provide the value as a float. If you provide a configId, the score value will be validated against the config’s numeric range, which might be defined by a minimum and/or maximum value.

from langfuse import get_client
langfuse = get_client()
 
# Method 1: Score via low-level method
langfuse.create_score(
    trace_id="trace_id_here",
    observation_id="observation_id_here", # optional
    name="accuracy",
    value=0.9,
    comment="Factually correct", # optional
    score_id="unique_id", # optional, can be used as an indempotency key to update the score subsequently
    config_id="78545-6565-3453654-43543", # optional, to ensure that the score follows a specific min/max value range
    data_type="NUMERIC" # optional, possibly inferred
)
 
# Method 2: Score within context
with langfuse.start_as_current_span(name="my-operation") as span:
    span.score(
        name="accuracy",
        value=0.9,
        comment="Factually correct",
        config_id="78545-6565-3453654-43543",
        data_type="NUMERIC"
    )

→ More details in Python SDK docs and JS/TS SDK docs. See API reference for more details on POST/GET score configs endpoints.

Inferred Score Properties

Certain score properties might be inferred based on your input:

  • If you don’t provide a score data type it will always be inferred. See tables below for details.
  • For boolean and categorical scores, we will provide the score value in both numerical and string format where possible. The score value format that is not provided as input, i.e. the translated value is referred to as the inferred value in the tables below.
  • On read for boolean scores both numerical and string representations of the score value will be returned, e.g. both 1 and True.
  • For categorical scores, the string representation is always provided and a numerical mapping of the category will be produced only if a ScoreConfig was provided.

Detailed Examples:

For example, let’s assume you’d like to ingest a numeric score to measure accuracy. We have included a table of possible score ingestion scenarios below.

ValueData TypeConfig IdDescriptionInferred Data TypeValid
0.9NullNullData type is inferredNUMERICYes
0.9NUMERICNullNo properties inferredYes
depthNUMERICNullError: data type of value does not match provided data typeNo
0.9NUMERIC78545No properties inferredConditional on config validation
0.9Null78545Data type inferredNUMERICConditional on config validation
depthNUMERIC78545Error: data type of value does not match provided data typeNo

Update Existing Scores

When creating a score, you can provide an optional id parameter. This will update the score if it already exists within your project.

If you want to update a score without needing to fetch the list of existing scores from Langfuse, you can set your own id parameter when initially creating the score.

Was this page helpful?