We Used Autoresearch on Our AI Skill, It Taught Us to Write Better Tests
We applied Karpathy's autoresearch to optimize our Langfuse prompt migration skill — and got a lesson in why the target function matters more than the optimizer.
Autoresearch has generated a lot of excitement since Karpathy released the package, even he himself finding 20 improvements to code he'd hand-tuned for months. It's designed for optimizing code, but the underlying principle applies more broadly. After seeing this post online, it made us wonder: how well would it do optimizing the Langfuse skill?
What is Autoresearch
Autoresearch is a minimal ~630-line Python script that automates the experimentation loop. An outer "optimizer" agent reads a target file, generates a hypothesis for how to improve it, and makes changes. An inner loop then runs the modified file against an evaluator that scores the result. If the score improves, the change is kept; if not, it's discarded. This repeats indefinitely, you just kill it when you're done.
![]()
The result is hands-off iteration at machine speed. Karpathy reported ~12 experiments per hour, meaning ~100 experiments overnight.
The Setup for Autoresearching the Langfuse Skill
The Langfuse skill is quite broad. It has instructions for general best practices, and detailed guidance for specific use cases. To contain the scope, we decided to focus specifically on the prompt migration use case.
The specific prompt migration instructions are in references/prompt-migration.md.
To make this approach work for an AI skill instead of code, we made a couple of adaptations to the autoresearch setup:
| Autoresearch | Our Equivalent | Description |
|---|---|---|
train.py | skill repo | The skill repo, including the prompt-migration specific file |
prepare.py | evaluate.py | Evaluation harness: runs claude -p "Migrate my prompts to Langfuse" against test codebases and scores the result based on a target function |
program.md | program.md | Instructions for the optimizer agent |
test_cases repo | A repository with example code bases that contain hardcoded prompts |
We used the following target function:
score = (correctness * 0.5) + (completeness * 0.3) + (efficiency * 0.2)- Correctness is determined using static code checks: original prompts removed,
get_prompt()and.compile()present,label="production"used, prompts created in Langfuse with correct content,{{var}}syntax, correct type - Completeness is measured as the fraction of expected prompts actually migrated
- Efficiency is bonus/penalty based on number of agent turns
We also added a stopping criterion, which was not present in the original project: the loop stops after 5 consecutive experiments with no improvement.
Test Case Repositories
We built 6 test codebases of increasing difficulty. In each experiment, the agent runs the skill against all 6 and the evaluator scores the results.
| Case | Description | Prompts | Difficulty |
|---|---|---|---|
| 01 | Single-file OpenAI chatbot | 1 | Easy |
| 02 | Multi-file OpenAI + Anthropic | 3 | Medium |
| 03 | Jinja2 templates with {% if %}, {% for %} | 2 | Hard |
| 04 | 12 prompts across 9 files in 5 modules | 12 | Hard (scale) |
| 05 | Edge cases: f-strings, .format(), concatenation, dicts | 6 | Hard (variety) |
| 06 | Scattered: prompts in .txt, .yaml, .md, and inline code | 6 | Hard (discovery) |
For each case we have an expected.json defining exactly what the evaluator checks: which prompts should exist in Langfuse, which variables they should have, which patterns should be gone from the code.
Before and After
Here's what the skill looked like before and after 14 experiments. As you can see, the tone and degree of freedom for the agent is completely different.
Original skill (before)
# Langfuse Prompt Migration
Migrate hardcoded prompts to Langfuse for version control, A/B testing, and deployment-free iteration.
## Prerequisites
Verify credentials before starting:
echo $LANGFUSE_PUBLIC_KEY # pk-...
echo $LANGFUSE_SECRET_KEY # sk-...
echo $LANGFUSE_HOST # https://cloud.langfuse.com or self-hosted
If not set, ask user to configure them first.
## Migration Flow
1. Scan codebase for prompts
2. Analyze templating compatibility
3. Propose structure (names, subprompts, variables)
4. User approves
5. Create prompts in Langfuse
6. Refactor code to use get_prompt()
7. Link prompts to traces (if tracing enabled)
8. Verify application works
## Step 1: Find Prompts
Search for these patterns:
| Framework | Look for |
|-----------|----------|
| OpenAI | messages=[{"role": "system", "content": "..."}] |
| Anthropic | system="..." |
| LangChain | ChatPromptTemplate, SystemMessage |
| Vercel AI | system: "...", prompt: "..." |
| Raw | Multi-line strings near LLM calls |
## Step 2: Check Templating Compatibility
CRITICAL: Langfuse only supports simple {{variable}} substitution. No conditionals, loops, or filters.
### Decision Tree
Contains {% if %}, {% for %}, or filters?
├─ No → Direct migration
└─ Yes → Choose:
├─ Option A (RECOMMENDED): Move logic to code, pass pre-computed values
└─ Option B: Store raw template, compile client-side with Jinja2
└─ ⚠️ Loses: Playground preview, UI experiments
## Step 3: Propose Structure
### Naming Conventions
| Rule | Example | Bad |
|------|---------|-----|
| Lowercase, hyphenated | chat-assistant | ChatAssistant_v2 |
| Feature-based | document-summarizer | prompt1 |
| Hierarchical for related | support/triage | supportTriage |
| Prefix subprompts with _ | _base-personality | shared-personality |
### Identify Subprompts
Extract when:
- Same text in 2+ prompts
- Represents distinct component (personality, safety rules, format)
- Would need to change together
### Variable Extraction
| Make Variable | Keep Hardcoded |
|---------------|----------------|
| User-specific ({{user_name}}) | Output format instructions |
| Dynamic content ({{context}}) | Safety guardrails |
| Per-request ({{query}}) | Persona/personality |
| Environment-specific ({{company_name}}) | Static examples |
## Step 4: Present Plan to User
Found N prompts across M files:
...
Proceed?
## Step 5: Create Prompts in Langfuse
Use langfuse.create_prompt() with:
- name, prompt, type ("text" or "chat"), labels: ["production"], config
Labeling strategy:
- production → All migrated prompts
- staging → Add later for testing
- latest → Auto-applied by Langfuse
## Step 6: Refactor Code
prompt = langfuse.get_prompt("name", label="production")
messages = prompt.compile(var1=value1, var2=value2)
## Step 7: Link Prompts to Traces
If codebase uses Langfuse tracing, link prompts so you can see which version produced each response.
### Detect Existing Tracing
Look for: @observe() decorators, langfuse.trace() calls, instrumented OpenAI client
### Link Methods
| Setup | How to Link |
|-------|-------------|
| @observe() decorator | langfuse_context.update_current_observation(prompt=prompt) |
| Manual tracing | trace.generation(prompt=prompt, ...) |
| OpenAI integration | openai.chat.completions.create(..., langfuse_prompt=prompt) |
## Step 8: Verify Migration
Checklist:
- All prompts created with production label
- Code fetches with label="production"
- Variables compile without errors
- Subprompts resolve correctly
- Application behavior unchanged
- Generations show linked prompt in UI (if tracing)Skill after autoresearch
# Langfuse Prompt Migration
Migrate ALL hardcoded prompts in a codebase to Langfuse prompt management.
Do not ask for confirmation — just execute the full migration.
## Prerequisites
Load credentials from .env:
set -a; source .env; set +a
Verify they are set: echo $LANGFUSE_PUBLIC_KEY
## Step 1: Find ALL Prompts
Search the ENTIRE codebase. Prompts hide in many places.
Search code files:
grep -rn "system.*content|role.*system|system=\"|system='" --include="*.py" --include="*.ts" --include="*.js" .
grep -rn "messages.create|completions.create|chat.completions" --include="*.py" --include="*.ts" .
grep -rn "ChatPromptTemplate|SystemMessage|HumanMessage|Template(" --include="*.py" .
Search non-code asset files — these are prompts too:
find . -name "*.txt" -o -name "*.yaml" -o -name "*.yml" -o -name "*.md" | grep -v node_modules
grep -rn "prompt|system|instruction" --include="*.yaml" --include="*.yml" .
grep -rn "open(" --include="*.py" . | grep -v __pycache__
Read ALL .txt, .yaml, .yml, and .md files you find — they may contain prompts loaded by Python code.
Do not stop until you have checked every file in the project.
## Step 2: Build Prompt Inventory
Before writing ANY code, make a complete list of every prompt you found. For each one, note:
1. Name: descriptive, lowercase, hyphenated
2. Source file: where the prompt text lives
3. Code file to refactor: the Python/JS file that USES the prompt
4. Type: chat or text
5. Variables: converted to {{var}} syntax
6. Prompt content: the actual text to upload
Jinja2 templates ({% if %}, {% for %}, filters): These CANNOT go into Langfuse. You must flatten them.
Example — flattening a Jinja2 conditional:
BEFORE: "You are an assistant. {% if user.is_premium %}You have premium access.{% else %}You have basic access.{% endif %}"
AFTER: "You are an assistant. {{capability_instructions}}"
Code: capability_instructions = "You have premium access." if user.is_premium else "You have basic access."
Example — flattening a Jinja2 loop:
BEFORE: "Available tools: {% for tool in tools %}- {{tool.name}}: {{tool.description}}\n{% endfor %}"
AFTER: "Available tools: {{tool_descriptions}}"
Code: tool_descriptions = "\n".join(f"- {t.name}: {t.description}" for t in tools)
## Step 3: Create ALL Prompts in Langfuse
CRITICAL — Variable syntax: Langfuse uses DOUBLE curly braces: {{var}}.
Never upload {var} — it must be {{var}}.
Create every prompt using curl. No pip install needed.
curl -s -X POST "$LANGFUSE_HOST/api/public/v2/prompts" \
-u "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" \
-H "Content-Type: application/json" \
-d '{"name": "prompt-name", "prompt": "...", "type": "text", "labels": ["production"]}'
IMPORTANT: Create ALL prompts BEFORE refactoring any code.
## Step 4: Refactor ALL Code
from langfuse import Langfuse
langfuse = Langfuse()
prompt = langfuse.get_prompt("prompt-name", label="production")
compiled = prompt.compile(variable_name=value)
Key rules:
- ALWAYS use label="production"
- ALWAYS call .compile() with all variables
- Remove the original hardcoded prompt string entirely
- For asset files: remove file-reading code, replace with get_prompt().compile()
- For Jinja2 templates: remove Template() and .render() calls
## Step 5: Verify
1. Code no longer contains hardcoded prompt text
2. All code uses get_prompt() with label="production" and .compile()
3. Asset files are no longer read by the codeIn summary, a couple of things changed:
| Aspect | Original | After Autoresearch |
|---|---|---|
| Tone | More of a documentation guide: explains concepts, offers options | Action script: tells the agent exactly what to do |
| Approval gate | "4. User approves" step with Proceed? prompt | Removed: "Do not ask for confirmation — just execute" |
| Prompt discovery | Framework table to "look for" | Literal grep and find commands |
| Asset file discovery | Not mentioned | Explicit search for .txt, .yaml, .md files + open() calls |
| Jinja2 handling | Decision tree with Option A vs B | Two concrete before/after flattening examples |
| Prompt creation | Python SDK (langfuse.create_prompt()) | curl against REST API (no pip dependency) |
| Variable syntax | Mentioned in a table row | **CRITICAL** bold warning, repeated emphasis |
| Subprompts | Full section on identifying and extracting | Removed entirely |
| Trace linking | Full step with detection + methods table | Removed entirely |
| Inventory step | Implicit | Explicit: build a complete list with 6 fields per prompt before touching anything |
We didn't accept all the updates
The target score went from 0.35 to 0.824, but as we'll see below, not all of these changes are improvements.
![]()
You can see the changes we cherry-picked and committed to the real skill in this PR. The rest was discarded.
Digging Into the Results
The agent ran 14 experiments total. There were instances where autoresearch found something useful, and where it optimized for the test harness instead of actual usage.
Where It Did Well
1. The double-brace CRITICAL warning
Across multiple experiments, the agent kept uploading prompts to Langfuse with {var} instead of {{var}}. Langfuse uses double curly braces for variable substitution — single braces get treated as literal text.
The original skill mentioned this too, but it was tucked away in a table row. Autoresearch escalated it twice: first to inline instructions, then to a **CRITICAL** bold warning at the top of the prompt creation step. Each escalation improved the score.
2. The inventory step
On complex cases (12 prompts across 9 files), the agent would often start modifying code before it had a full picture of what needed to change, and then miss prompts or lose track.
Autoresearch added an explicit step: "Before writing ANY code, make a complete list of every prompt you found" with 6 required fields (name, source file, code file to refactor, type, variables, prompt content). This forced the agent to plan before acting. Case 04 went from erratic results to consistently scoring above 0.88. Planning before acting is obvious advice for humans; turns out agents need it spelled out.
Where It Did Something Irrelevant or Harmful
1. Skipping doc fetching
The main skill file says "Documentation First: always fetch current docs before writing code" because Langfuse updates frequently. Autoresearch changed this to "For tasks covered by a reference file, follow that file directly." This saves turns and improves the efficiency score. It also means the skill will silently use outdated instructions when the API changes. While this improved executions for the test harness today, it will break in production tomorrow.
![]()
2. Removing the approval gate and switching to curl
This was autoresearch's single biggest score improvement — cases 01-03 went from 0.00 correctness to 1.00 in one experiment. The original skill had a step where the agent presents its migration plan and waits for user approval before modifying code. In the autoresearch harness, there's no human, so that approval never comes and the agent just plans forever. Removing it makes the skill worse for its actual use case, where you probably want to review what it's about to do.
Similarly, autoresearch switched from langfuse.create_prompt() (Python SDK) to raw curl commands against the REST API, because pip install langfuse would fail in the sandboxed test environments. In real user repositories where langfuse is already installed, the SDK is cleaner and less error-prone.
This was not completely the agent's fault. It's a good example of how not spending enough time on making the agent harness representative for the real world led to suboptimal results.
3. Removing subprompts and trace linking
The original skill had full sections on identifying shared text across prompts (subprompts) and linking prompts to traces for observability. Autoresearch removed both entirely. Why? Because none of the 6 test cases covered these features. If it's not measured, it gets cut. From a product perspective, these are features users actually need — the skill now can't help with a common post-migration step.
Your Target Function and Agent Harness are Everything
Autoresearch optimizes for exactly what you measure given the context you execute in. If your target function has gaps, it will find and exploit them. The community around autoresearch has been raising this same concern: it's Goodhart's Law at machine speed. Whatever metric you expose, the agent will exploit it relentlessly.
Next to that, common criticism of autoresearch in the broader community is validation set overfitting. Run hundreds of experiments against a fixed set of test cases, and you end up optimizing for quirks of that specific data.
The problem is: most people will not build a complete enough target function. We didn't.
For a skill that does one narrow thing, it's feasible to build a target function that covers the full surface area. And autoresearch will probably give you great results. For example, Shopify's Tobi Lutke applied autoresearch to their Liquid templating engine — a narrow, well-defined optimization target — and got 53% faster rendering and 61% fewer memory allocations from 93 automated commits. He still noted the overfitting though.
For broad skills like this one, the surface area is too large to get everything into the target function and harness. So treat the output as inspiration. And spend enough time on the preparation. The workflow is not "run it and commit the result." It's:
- Spend enough time on the setup to get the harness and target function right. As the community around autoresearch has noted, the human job moves from "can you implement this?" to "can you write a good program.md that produces useful research?"
- Let it run
- Review critically: read every change, understand why it was made, ask whether it's a real improvement or a harness/target function artifact
- Cherry-pick the relevant improvements, discard the overfitting
The bottom line: review it like a junior engineer's PR. They'll have good ideas mixed with bad ones. Some changes will be insightful. Others will be shortcuts that happen to pass the tests, like removing the approval gate, switching to curl. In ambiguous settings like this, it's your job to separate them.
Would We Do It Again?
Yes. Despite everything above, autoresearch was definitely useful.
It tested the skill far more than we ever would have manually: 14 experiments across 6 codebases, dozens of full agent runs. It surfaced failure modes we hadn't considered (the double-brace issue, the planning-before-acting problem). It's valuable to have a process that stress-tests your skill from angles you wouldn't have thought of.
Using autoresearch for prompt optimization
Beyond skills, the same pattern could apply to prompt optimization. We've explored prompt improvement workflows before, and automating that loop with autoresearch is a natural next step. Though the same caveats apply: if your evaluation dataset is too narrow or your target function too simple, you'll overfit just as efficiently.