Migrate Langfuse v2 to v3
This guide covers a developer preview which is not suitable for production use. v3 is under active development and we plan to ship a production-ready version by the end of November 2024. We share this information to gather feedback from our awesome developer community.
For a production-ready setup, follow the self-hosting guide or consider using Langfuse Cloud maintained by the Langfuse team.
If you want to get started on v3, please follow our v3 Self-Host Guide.
To learn more about our reasons for the architectural changes, jump to the Reasoning section.
SDK compatibility and API changes
While we aim to keep our SDKs and APIs fully backwards compatible, we have to introduce backwards incompatible changes with our update to v3. Certain APIs in SDK versions below version 2.0.0 are not compatible with our new backend architecture. Please upgrade and benefit from many performance improvements or features such as prompt caching.
Upgrade options:
- Default SDK upgrade: Follow the 1.x.x to 2.x.x upgrade path (Python, JavaScript). For the JavaScrupt SDK, consider an upgrade from 2.x.x to 3.x.x as well. The upgrade is straightforward and should not take much time.
- Improved integrations: Since the first major version, we built many new ways to integrate your code with Langfuse such as Decorators for Python. We would recommend to check out our quickstart to see whether there is a more convenient integration available for you.
Background: Langfuse v3 relies on an event driven backend architecture. This means, that we acknowledge HTTP requests from the SDKs, queue the HTTP bodies in the backend, and process them asynchronously. This allows us to scale the backend more easily and handle more requests without overloading the database. The SDKs below 2.0.0 send the events to our server and expect a synchronous response containing the database representation of the event. If you rely on this data and access it in the code, your SDK will break as of Nov. 11th, 2024 for the cloud version and post-upgrade for self-hosted versions.
API Changes
POST /api/public/ingestion
The /api/public/ingestion
endpoint is now asynchronous.
It will accept all events as they come in and queue them for processing before returning a 207 status code.
This means that events will not be available immediately after acceptance by the backend and instead will appear
eventually in subsequent read requests.
The individual events accepted a metadata
property within their body of type string | string[] | Record<string, unknown>
.
Only the Record
type is supported within our UI and endpoints to perform queries and filter events.
Therefore, we enforce an object type for metadata
going forward.
All incoming events with { metadata: string | string[] }
will have their metadata mapped to an object with key metadata
,
i.e. { event: { body: { metadata: "test" } } }
will be transformed to { event: { body: { metadata: { metadata: "test" } } } }
.
POST /api/public/scores
The /api/public/scores
endpoint is now asynchronous.
It behaves exactly as the /api/public/ingestion
endpoint, but will return a 200 status code with a body of { id: string }
type.
Before, the endpoint returned the created score object directly.
This change is inline with our API reference and therefore not considered breaking.
Deprecated endpoints
The following endpoints are deprecated since our v2 release. We continue to accept requests to these endpoints and will remove them in a future release. Their API behavior changes to be asynchronous and the endpoints will only return the id of the created object instead of the full updated record.
- POST /api/public/events
- POST /api/public/generations
- PATCH /api/public/generations
- POST /api/public/spans
- PATCH /api/public/spans
- POST /api/public/traces
Behavioral Changes
Trace Deletion
Deleting traces within the Langfuse UI was a synchronous operation and traces got removed immediately. Going forward, all traces will be scheduled for deletion, but may still be visible in the UI for a short period of time.
Project Deletion
Deleting projects within Langfuse was a synchronous operation and projects got removed immediately. Projects will be marked as deleted within Langfuse v3 and will not be accessible using the standard UI navigation options. We immediately revoke access keys for projects, but all remaining data will be removed in the background. Information will not be deleted from the S3/Blob Store. This action needs to be performed manually by an administrator.
Migration Steps
We tried to make the version upgrade as seamless as possible. If you encounter any issues please reach out to our support team or open an issue on our GitHub repository.
Before you start the upgrade
Before starting your upgrade, make sure you are familiar with the contents of our v3 hosting guide. In addition, we recommend that you perform a backup of your Postgres database before you start the upgrade. Also, ensure that you run a recent version of Langfuse, ideally a version later than v2.92.0.
For a zero-downtime upgrade, we recommend that you provision new instances of the Langfuse web and worker containers and move your traffic after validating that the new instances are working as expected.
Upgrade Steps
1. Provision new infrastructure
Ensure that you deploy all required storage components (Clickhouse, Redis, S3/Blob Store) and have the connection information handy. You can reuse your existing Postgres instance for the new deployment. Ensure that you also have your Postgres connection details ready.
The following new environment variables are required for the new Langfuse web and worker containers. If you do not provide them, the deployment will fail.
CLICKHOUSE_URL
CLICKHOUSE_USER
CLICKHOUSE_PASSWORD
CLICKHOUSE_MIGRATION_URL
REDIS_CONNECTION_STRING
(or its alternatives)LANGFUSE_S3_EVENT_UPLOAD_BUCKET
2. Start new Langfuse containers.
Deploy the Langfuse web and worker containers with the settings from our self-hosting guide. Ensure that you set the environment variables for the new storage components and the Postgres connection details.
At this point, you can start to test the new Langfuse instance. The UI should load as expected, but there should not be any traces, observations, or scores. This is expected, as data is being read from Clickhouse while those elements still reside in Postgres.
3. Shift traffic from v2 to v3
Point the traffic to the new Langfuse instance by updating your DNS records or your Loadbalancer configuration. All new events will be stored in Clickhouse and should appear within the UI within a few seconds of being ingested.
4. Wait for historic data migration to complete
We have introduced background migrations as part of the migration to v3. Those allow Langfuse to schedule longer-running migrations without impacting the availability of the service. As part of the v3 release, we have introduced four migrations that will run once you deploy the new stack.
- Cost backfill: We calculate costs for all events and store them in the Postgres database. Before, those were calculated on read which had a negative impact on read performance.
- Traces migration: We migrate all traces in batches from Postgres to Clickhouse. We start with most recent traces, i.e. those should show within your dashboard soon after starting the migration.
- Observations migration: We migrate all observations in batches from Postgres to Clickhouse. We start with most recent observations, i.e. those should show within your dashboard soon after starting the migration.
- Scores migration: We migrate all scores in batches from Postgres to Clickhouse. We start with most recent scores, i.e. those should show within your dashboard soon after starting the migration.
Each migration has to finish, before the next one starts. Depending on the size of your event tables, this process may take multiple hours.
You need to set LANGFUSE_ENABLE_BACKGROUND_MIGRATIONS=true
to enable background migrations and start the migration of data from Postgres to ClickHouse.
5. Stop the old Langfuse containers
After you have verified that new events are being stored in Clickhouse and are shown in the UI, you can stop the old Langfuse containers.
Reasoning
Langfuse has gained significant traction over the last months, both in our Cloud environment and in self-hosted setups. With Langfuse v3 we introduce changes that allow our backend to handle hundreds of events per second with higher reliability. To achieve this scale, we introduce a second Langfuse container and additional storage services like S3/Blob store, Clickhouse, and Redis. We explain why we chose each element of the stack and why we believe that they help us achieve best in class scale.
Why Clickhouse
We made the strategic decision to migrate our traces, observations, and scores table from Postgres to Clickhouse. Both us and our self-hosters observed bottlenecks in Postgres when dealing with millions of rows of tracing data, both on ingestion and retrieval of information. Our core requirement was a database that could handle massive volumes of trace and event data with exceptional query speed and efficiency while also being available for free to self-hosters.
Limitations of Postgres
Initially, Postgres was an excellent choice due to its robustness, flexibility, and the extensive tooling available. As our platform grew, we encountered performance bottlenecks with complex aggregations and time-series data. The row-based storage model of PostgreSQL becomes increasingly inefficient when dealing with billions of rows of tracing data, leading to slow query times and high resource consumption.
Our requirements
- Analytical queries: all queries for our dashboards (e.g. sum of LLM tokens consumed over time)
- Table queries: Finding tracing data based on filtering and ordering selected via tables in our UI.
- Select by ID: Quickly locating a specific trace by its ID.
- High write throughput while allowing for updates. Our tracing data can be updated from the SKDs. Hence, we need an option to update rows in the database.
- Self-hosting: We needed a database that is free to use for self-hosters, avoiding dependencies on specific cloud providers.
- Low operational effort: As a small team, we focus on building features for our users. We try to keep operational efforts as low as possible.
Why Clickhouse is great
- Optimized for Analytical Queries: ClickHouse is a modern OLAP database capable of ingesting data at high rates and querying it with low latency. It handles billions of rows efficiently.
- Rich feature-set: Clickhouse offers different Table Engines, Materialized views, different types of Indices, and many integrations which helps us to build fast and achieve low latency read queries.
- Our self-hosters can use the official Clickhouse Helm Charts and Docker Images for deploying in the cloud infrastructure of their choice.
- Clickhouse Cloud: Clickhouse Cloud is a database as a SaaS service which allows us to reduce operational efforts on our side.
When talking to other companies and looking at their code bases, we learned that Clickhouse is a popular choice these days for analytical workloads. Many modern observability tools, such as Signoz or Posthog, as well as established companies like Cloudflare, use Clickhouse for their analytical workloads.
Clickhouse vs. others
We think there are many great OLAP databases out there and are sure that we could have chosen an alternative and would also succeed with it. However, here are some thoughts on alternatives:
- Druid: Unlike Druid’s modular architecture, ClickHouse provides a more straightforward, unified instance approach. Hence, it is easier for teams to manage Clickhouse in production as there are fewer moving parts. This reduces the operational burden especially for our self-hosters.
- StarRocks: We think StarRocks is great but early. The vast amount of features in Clickhouse help us to remain flexible with our requirements while benefiting from the performance of an OLAP database.
Building an adapter and support multiple databases
We explored building a multi-database adapter to support Postgres for smaller self-hosted deployments. After talking to engineers and reviewing some of PostHog’s Clickhouse implementation, we decided against this path due to its complexity and maintenance overhead. This allows us to focus our resources on building user features instead.
Why Redis
We added a Redis instance to serve cache and queue use-cases within our stack. With its open source license, broad native support my major cloud vendors, and ubiquity in the industry, Redis was a natural choice for us.
Why S3/Blob Store
Observability data for LLM application tends to contain large, semi-structured bodies of data to represent inputs and outputs. We chose S3/Blob Store as a scalable, secure, and cost-effective solution to store these large objects. It allows us to store all incoming events for further processing and acts as a native backup solution, as the full state can be restored based on the events stored there.
Why Worker Container
When processing observability data for LLM applications, there are many CPU-heavy operations which block the main loop in our Node.js backend, e.g. tokenization and other parsing of event bodies. To achieve high availability and low latencies across client applications, we decided to move the heavy processing into an asynchronous worker container. It accepts events from a Redis queue and ensures that they are eventually being upserted into Clickhouse.
Support
If you experience any issues, please join us on Discord or contact the maintainers at [email protected].
For support with production deployments, the Langfuse team provides dedicated enterprise support. To learn more, reach out to [email protected] or schedule a demo.
Alternatively, you may consider using Langfuse Cloud, which is a fully managed version of Langfuse. You can find information about its security and privacy here.
FAQ
- Are prebuilt ARM images available?
- Are older versions of the SDK compatible with newer versions of Langfuse?
- I cannot connect to my docker deployment, what should I do?
- I have forgotten my password
- How can I restrict access on my self-hosted instance to internal users?
- How to manage different environments in Langfuse?
- Can I deploy multiple instances of Langfuse behind a load balancer?
- What kind of telemetry does Langfuse collect?