Custom Scores via API/SDKs
- HobbyFull
- ProFull
- TeamFull
- Self HostedFull
Langfuse gives you full flexibility to ingest custom scores
via the Langfuse SDKs or API. The scoring workflow allows you to run custom quality checks on the output of your workflows at runtime, or to run custom human evaluation workflows.
Common use cases:
- Collecting user feedback (example): collect feedback from your users on application quality or performance. Can be captured in the frontend via our Browser SDK.
- Custom evaluation data pipeline (example): continuously monitor the quality by fetching traces from Langfuse, running custom evaluations, and ingesting scores back into Langfuse.
- Guardrails and security checks (example): check if output contains a certain keyword, adheres to a specified structure/format or if the output is longer than a certain length.
- Custom internal workflow tooling: build custom internal tooling that helps you manage human-in-the-loop workflows. Ingest scores back into Langfuse, optionally following your custom schema by referencing a config.
- Custom run-time evaluations: e.g. track whether the generated SQL code actually worked, or if the structured output was valid JSON.
How to add scores
You can add scores via the Langfuse SDKs or API. Scores can take one of three data types:
- Numeric: used to record scores that fall into a numerical range
- Categorical: used to record string score values
- Boolean: used to record binary score values
SDK ingestion examples by data type
Numeric score values must be provided as float.
Python SDK example
langfuse.score(
id="unique_id", # optional, can be used as an indempotency key to update the score subsequently
trace_id=message.trace_id,
observation_id=message.generation_id, # optional
name="correctness",
value=0.9,
data_type="NUMERIC", # optional, inferred if not provided
comment="Factually correct", # optional
)
JavaScript/TypeScript SDK example
await langfuse.score({
id: "unique_id", // optional, can be used as an indempotency key to update the score subsequently
traceId: message.traceId,
observationId: message.generationId, // optional
name: "correctness",
value: 0.9,
dataType: "NUMERIC", // optional, inferred if not provided
comment: "Factually correct", // optional
});
→ More details in Python SDK docs and JS/TS SDK docs. See API reference for more details on POST/GET score configs endpoints.
How to ensure your scores comply with a certain schema
Given your scores are required to follow a specific schema such as data range, name or data type, you can define and reference a score configuration (config)
on your scores. Configs are helpful when you want to standardize your scores for future analysis. They can be defined in the Langfuse UI or via our API.
Whenever you provide a config, the score data will be validated against the config. The following rules apply:
- Score Name: Must equal the config’s name
- Score Data Type: When provided, must match the config’s data type
- Score Value: Must match the config’s data type and be within the config’s value range:
- Numeric: Value must be within the min and max values defined in the config (if provided, min and max are optional and otherwise are assumed as -∞ and +∞ respectively)
- Categorical: Value must map to one of the categories defined in the config
- Boolean: Value must equal
0
or1
Score ingestion referencing configs via SDK
When ingesting numeric scores, you can provide the value as a float. If you provide a configId, the score value will be validated against the config’s numeric range, which might be defined by a minimum and/or maximum value.
langfuse.score(
trace_id=message.trace_id,
observation_id=message.generation_id, # optional
name="accuracy",
value=0.9,
comment="Factually correct", # optional
id="unique_id" # optional, can be used as an indempotency key to update the score subsequently
config_id="78545-6565-3453654-43543" # optional, to ensure that the score follows a specific min/max value range
data_type="NUMERIC" # optional, possibly inferred
)
await langfuse.score({
traceId: message.traceId,
observationId: message.generationId, // optional
name: "accuracy",
value: 0.9,
comment: "Factually correct", // optional
id: "unique_id", // optional, can be used as an indempotency key to update the score subsequently
configId: "78545-6565-3453654-43543", // optional, to ensure that the score follows a specific min/max value range
dataType: "NUMERIC", // optional, possibly inferred
});
→ More details in Python SDK docs and JS/TS SDK docs. See API reference for more details on POST/GET score configs endpoints.
Creating Score Config object in Langfuse
A score config
includes the desired score name, data type, and constraints on score value range such as min and max values for numerical data types and custom categories for categorical data types. See API reference for more details on POST/GET score configs endpoints. Configs are crucial to ensure that scores comply with a specific schema therefore standardizing them for future analysis.
Attribute | Type | Description |
---|---|---|
id | string | Unique identifier of the score config. |
name | string | Name of the score config, e.g. user_feedback, hallucination_eval |
dataType | string | Can be either NUMERIC , CATEGORICAL or BOOLEAN |
isArchived | boolean | Whether the score config is archived. Defaults to false |
minValue | number | Optional: Sets minimum value for numerical scores. If not set, the minimum value defaults to -∞ |
maxValue | number | Optional: Sets maximum value for numerical scores. If not set, the maximum value defaults to +∞ |
categories | list | Optional: Defines categories for categorical scores. List of objects with label value pairs |
description | string | Optional: Provides further description of the score configuration |
Detailed Score Ingestion Examples
Certain score properties might be inferred based on your input. If you don’t provide a score data type it will always be inferred. See tables below for details. For boolean and categorical scores, we will provide the score value in both numerical and string format where possible. The score value format that is not provided as input, i.e. the translated value is referred to as the inferred value in the tables below. On read for boolean scores both numerical and string representations of the score value will be returned, e.g. both 1 and True. For categorical scores, the string representation is always provided and a numerical mapping of the category will be produced only if a score config was provided.
For example, let’s assume you’d like to ingest a numeric score to measure accuracy. We have included a table of possible score ingestion scenarios below.
Value | Data Type | Config Id | Description | Inferred Data Type | Valid |
---|---|---|---|---|---|
0.9 | Null | Null | Data type is inferred | NUMERIC | Yes |
0.9 | NUMERIC | Null | No properties inferred | Yes | |
depth | NUMERIC | Null | Error: data type of value does not match provided data type | No | |
0.9 | NUMERIC | 78545 | No properties inferred | Conditional on config validation | |
0.9 | Null | 78545 | Data type inferred | NUMERIC | Conditional on config validation |
depth | NUMERIC | 78545 | Error: data type of value does not match provided data type | No |
Data pipeline example
You can run custom evaluations on data in Langfuse by fetching traces from Langfuse (e.g. via the Python SDK) and then adding evaluation results as scores
back to the traces in Langfuse.
The example notebook is a good template to get started with building your own evaluation pipeline.