Troubleshooting
This guide covers common issues that Langfuse self-hosters observe and how to address them. If you encounter an issue that is not covered here, please open an issue or start a discussion.
Missing Events After POST /api/public/ingestion
If you are not seeing events within minutes of posting them to /api/public/ingestion
, it is likely that the events are not being ingested correctly.
Events do not appear immediately in the UI, as they are being processed asynchronously.
If your events are not shown after a few minutes, you can check the following:
- Check the Langfuse Web logs: Look for any errors in the Langfuse Web container around the time that you ingested the events. Any errors you observe indicate that the event is malformatted or that either Redis or S3 are not available. In this case, you should also see non-207 status codes within your application.
- Check the S3/Blob Storage bucket: Validate that the event was uploaded correctly into your blob storage.
It should be available in a path like
/<projectId>/<type>/<eventId>/<randomId>.json
. If the event was accepted in the Langfuse Web container, but is not available in S3, it indicates an issue with your S3 configuration. - Check the Langfuse Worker logs: Look for any errors in the Langfuse Worker container about within 0-60 seconds of ingesting your event. If no events at all are being processed, it usually indicates a configuration issue around Redis or S3.
- Check ClickHouse tables: If the previous processing looks correct, validate whether you can find the event in ClickHouse in the
traces
,observations
, orscores
table. Search for the respectiveprojectId
andeventId
. If you cannot find the event in ClickHouse, but the worker indicates it was processed or if you can find it in ClickHouse and it is not returned via the API, please open an issue.
Intermittent 502 and 504 network errors
If you are experiencing intermittent 502 and 504 network errors, this is likely related to your Loadbalancer and keep-alive configuration. It is recommended to have the keep-alive of a server set to a higher value than the idle timeout on a Loadbalancer.
As an example, the AWS Application Loadbalancer has a default idle timeout of 60 seconds. If your service closes a connection after 45 seconds, the Loadbalancer will attempt to reuse the connection which it still believes to be alive which will result in a 502 error.
Hence, we recommend that you configure KEEP_ALIVE_TIMEOUT
on the Langfuse Web container to be at least 5 seconds higher than your Loadbalancer idle timeout.
JavaScript heap out of memory
If you are experiencing JavaScript heap out of memory errors within your applications, it indicates that the application thinks it has less memory available than it does. An example case is that your container has 2 GiB of memory, whereas the Node.js application uses the default max-old-space-size of 1.7 GiB. This would surface as an error message like
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
To address this issue, we recommend that you configure NODE_OPTIONS=--max-old-space-size=${var.memory}
on the Langfuse Web and the Langfuse Worker containers.
Use the available memory in MiB as the value for var.memory
, e.g. 4096 for 4 GiB of memory.
The value should be equal or above the memory limit of the container.
This ensures that your container orchestrator kills the pod gracefully if the memory limit is exceeded, instead of the application terminating abruptly.