March 24, 2026

We Used Autoresearch on Our AI Skill, It Taught Us to Write Better Tests

We applied Karpathy's autoresearch to optimize our Langfuse prompt migration skill — and got a lesson in why the target function matters more than the optimizer.

Lotte Verheyden

Autoresearch has generated a lot of excitement since Karpathy released the package, even he himself finding 20 improvements to code he'd hand-tuned for months. It's designed for optimizing code, but the underlying principle applies more broadly. After seeing this post online, it made us wonder: how well would it do optimizing the Langfuse skill?

What is Autoresearch

Autoresearch is a minimal ~630-line Python script that automates the experimentation loop. An outer "optimizer" agent reads a target file, generates a hypothesis for how to improve it, and makes changes. An inner loop then runs the modified file against an evaluator that scores the result. If the score improves, the change is kept; if not, it's discarded. This repeats indefinitely, you just kill it when you're done.

The result is hands-off iteration at machine speed. Karpathy reported ~12 experiments per hour, meaning ~100 experiments overnight.

The Setup for Autoresearching the Langfuse Skill

The Langfuse skill is quite broad. It has instructions for general best practices, and detailed guidance for specific use cases. To contain the scope, we decided to focus specifically on the prompt migration use case.

SKILL.md

instrumentation.md

prompt-migration.md

...

The specific prompt migration instructions are in references/prompt-migration.md.

To make this approach work for an AI skill instead of code, we made a couple of adaptations to the autoresearch setup:

Autoresearch	Our Equivalent	Description
`train.py`	`skill repo`	The skill repo, including the prompt-migration specific file
`prepare.py`	`evaluate.py`	Evaluation harness: runs `claude -p "Migrate my prompts to Langfuse"` against test codebases and scores the result based on a target function
`program.md`	`program.md`	Instructions for the optimizer agent
	`test_cases repo`	A repository with example code bases that contain hardcoded prompts

We used the following target function:

score = (correctness * 0.5) + (completeness * 0.3) + (efficiency * 0.2)

Correctness is determined using static code checks: original prompts removed, get_prompt() and .compile() present, label="production" used, prompts created in Langfuse with correct content, {{var}} syntax, correct type
Completeness is measured as the fraction of expected prompts actually migrated
Efficiency is bonus/penalty based on number of agent turns

We also added a stopping criterion, which was not present in the original project: the loop stops after 5 consecutive experiments with no improvement.

Test Case Repositories

We built 6 test codebases of increasing difficulty. In each experiment, the agent runs the skill against all 6 and the evaluator scores the results.

Case	Description	Prompts	Difficulty
01	Single-file OpenAI chatbot	1	Easy
02	Multi-file OpenAI + Anthropic	3	Medium
03	Jinja2 templates with `{% if %}`, `{% for %}`	2	Hard
04	12 prompts across 9 files in 5 modules	12	Hard (scale)
05	Edge cases: f-strings, `.format()`, concatenation, dicts	6	Hard (variety)
06	Scattered: prompts in `.txt`, `.yaml`, `.md`, and inline code	6	Hard (discovery)

For each case we have an expected.json defining exactly what the evaluator checks: which prompts should exist in Langfuse, which variables they should have, which patterns should be gone from the code.

Before and After

Here's what the skill looked like before and after 14 experiments. As you can see, the tone and degree of freedom for the agent is completely different.

Original skill (before)

# Langfuse Prompt Migration

Migrate hardcoded prompts to Langfuse for version control, A/B testing, and deployment-free iteration.

## Prerequisites

Verify credentials before starting:

echo $LANGFUSE_PUBLIC_KEY   # pk-...
echo $LANGFUSE_SECRET_KEY   # sk-...
echo $LANGFUSE_HOST         # https://cloud.langfuse.com or self-hosted

If not set, ask user to configure them first.

## Migration Flow

1. Scan codebase for prompts
2. Analyze templating compatibility
3. Propose structure (names, subprompts, variables)
4. User approves
5. Create prompts in Langfuse
6. Refactor code to use get_prompt()
7. Link prompts to traces (if tracing enabled)
8. Verify application works

## Step 1: Find Prompts

Search for these patterns:

| Framework | Look for |
|-----------|----------|
| OpenAI | messages=[{"role": "system", "content": "..."}] |
| Anthropic | system="..." |
| LangChain | ChatPromptTemplate, SystemMessage |
| Vercel AI | system: "...", prompt: "..." |
| Raw | Multi-line strings near LLM calls |

## Step 2: Check Templating Compatibility

CRITICAL: Langfuse only supports simple {{variable}} substitution. No conditionals, loops, or filters.

### Decision Tree

Contains {% if %}, {% for %}, or filters?
├─ No → Direct migration
└─ Yes → Choose:
    ├─ Option A (RECOMMENDED): Move logic to code, pass pre-computed values
    └─ Option B: Store raw template, compile client-side with Jinja2
        └─ ⚠️ Loses: Playground preview, UI experiments

## Step 3: Propose Structure

### Naming Conventions

| Rule | Example | Bad |
|------|---------|-----|
| Lowercase, hyphenated | chat-assistant | ChatAssistant_v2 |
| Feature-based | document-summarizer | prompt1 |
| Hierarchical for related | support/triage | supportTriage |
| Prefix subprompts with _ | _base-personality | shared-personality |

### Identify Subprompts

Extract when:
- Same text in 2+ prompts
- Represents distinct component (personality, safety rules, format)
- Would need to change together

### Variable Extraction

| Make Variable | Keep Hardcoded |
|---------------|----------------|
| User-specific ({{user_name}}) | Output format instructions |
| Dynamic content ({{context}}) | Safety guardrails |
| Per-request ({{query}}) | Persona/personality |
| Environment-specific ({{company_name}}) | Static examples |

## Step 4: Present Plan to User

Found N prompts across M files:
  ...
Proceed?

## Step 5: Create Prompts in Langfuse

Use langfuse.create_prompt() with:
- name, prompt, type ("text" or "chat"), labels: ["production"], config

Labeling strategy:
- production → All migrated prompts
- staging → Add later for testing
- latest → Auto-applied by Langfuse

## Step 6: Refactor Code

prompt = langfuse.get_prompt("name", label="production")
messages = prompt.compile(var1=value1, var2=value2)

## Step 7: Link Prompts to Traces

If codebase uses Langfuse tracing, link prompts so you can see which version produced each response.

### Detect Existing Tracing

Look for: @observe() decorators, langfuse.trace() calls, instrumented OpenAI client

### Link Methods

| Setup | How to Link |
|-------|-------------|
| @observe() decorator | langfuse_context.update_current_observation(prompt=prompt) |
| Manual tracing | trace.generation(prompt=prompt, ...) |
| OpenAI integration | openai.chat.completions.create(..., langfuse_prompt=prompt) |

## Step 8: Verify Migration

Checklist:
- All prompts created with production label
- Code fetches with label="production"
- Variables compile without errors
- Subprompts resolve correctly
- Application behavior unchanged
- Generations show linked prompt in UI (if tracing)

Skill after autoresearch

# Langfuse Prompt Migration

Migrate ALL hardcoded prompts in a codebase to Langfuse prompt management.
Do not ask for confirmation — just execute the full migration.

## Prerequisites

Load credentials from .env:
set -a; source .env; set +a

Verify they are set: echo $LANGFUSE_PUBLIC_KEY

## Step 1: Find ALL Prompts

Search the ENTIRE codebase. Prompts hide in many places.

Search code files:
grep -rn "system.*content|role.*system|system=\"|system='" --include="*.py" --include="*.ts" --include="*.js" .
grep -rn "messages.create|completions.create|chat.completions" --include="*.py" --include="*.ts" .
grep -rn "ChatPromptTemplate|SystemMessage|HumanMessage|Template(" --include="*.py" .

Search non-code asset files — these are prompts too:
find . -name "*.txt" -o -name "*.yaml" -o -name "*.yml" -o -name "*.md" | grep -v node_modules
grep -rn "prompt|system|instruction" --include="*.yaml" --include="*.yml" .
grep -rn "open(" --include="*.py" . | grep -v __pycache__

Read ALL .txt, .yaml, .yml, and .md files you find — they may contain prompts loaded by Python code.

Do not stop until you have checked every file in the project.

## Step 2: Build Prompt Inventory

Before writing ANY code, make a complete list of every prompt you found. For each one, note:
1. Name: descriptive, lowercase, hyphenated
2. Source file: where the prompt text lives
3. Code file to refactor: the Python/JS file that USES the prompt
4. Type: chat or text
5. Variables: converted to {{var}} syntax
6. Prompt content: the actual text to upload

Jinja2 templates ({% if %}, {% for %}, filters): These CANNOT go into Langfuse. You must flatten them.

Example — flattening a Jinja2 conditional:
BEFORE: "You are an assistant. {% if user.is_premium %}You have premium access.{% else %}You have basic access.{% endif %}"
AFTER:  "You are an assistant. {{capability_instructions}}"
Code:   capability_instructions = "You have premium access." if user.is_premium else "You have basic access."

Example — flattening a Jinja2 loop:
BEFORE: "Available tools: {% for tool in tools %}- {{tool.name}}: {{tool.description}}\n{% endfor %}"
AFTER:  "Available tools: {{tool_descriptions}}"
Code:   tool_descriptions = "\n".join(f"- {t.name}: {t.description}" for t in tools)

## Step 3: Create ALL Prompts in Langfuse

CRITICAL — Variable syntax: Langfuse uses DOUBLE curly braces: {{var}}.
Never upload {var} — it must be {{var}}.

Create every prompt using curl. No pip install needed.

curl -s -X POST "$LANGFUSE_HOST/api/public/v2/prompts" \
  -u "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "prompt-name", "prompt": "...", "type": "text", "labels": ["production"]}'

IMPORTANT: Create ALL prompts BEFORE refactoring any code.

## Step 4: Refactor ALL Code

from langfuse import Langfuse
langfuse = Langfuse()

prompt = langfuse.get_prompt("prompt-name", label="production")
compiled = prompt.compile(variable_name=value)

Key rules:
- ALWAYS use label="production"
- ALWAYS call .compile() with all variables
- Remove the original hardcoded prompt string entirely
- For asset files: remove file-reading code, replace with get_prompt().compile()
- For Jinja2 templates: remove Template() and .render() calls

## Step 5: Verify

1. Code no longer contains hardcoded prompt text
2. All code uses get_prompt() with label="production" and .compile()
3. Asset files are no longer read by the code

In summary, a couple of things changed:

Aspect	Original	After Autoresearch
Tone	More of a documentation guide: explains concepts, offers options	Action script: tells the agent exactly what to do
Approval gate	"4. User approves" step with `Proceed?` prompt	Removed: "Do not ask for confirmation — just execute"
Prompt discovery	Framework table to "look for"	Literal `grep` and `find` commands
Asset file discovery	Not mentioned	Explicit search for `.txt`, `.yaml`, `.md` files + `open()` calls
Jinja2 handling	Decision tree with Option A vs B	Two concrete before/after flattening examples
Prompt creation	Python SDK (`langfuse.create_prompt()`)	`curl` against REST API (no pip dependency)
Variable syntax	Mentioned in a table row	`CRITICAL` bold warning, repeated emphasis
Subprompts	Full section on identifying and extracting	Removed entirely
Trace linking	Full step with detection + methods table	Removed entirely
Inventory step	Implicit	Explicit: build a complete list with 6 fields per prompt before touching anything

We didn't accept all the updates

The target score went from 0.35 to 0.824, but as we'll see below, not all of these changes are improvements.

You can see the changes we cherry-picked and committed to the real skill in this PR. The rest was discarded.

Digging Into the Results

The agent ran 14 experiments total. There were instances where autoresearch found something useful, and where it optimized for the test harness instead of actual usage.

Where It Did Well

1. The double-brace CRITICAL warning

Across multiple experiments, the agent kept uploading prompts to Langfuse with {var} instead of {{var}}. Langfuse uses double curly braces for variable substitution — single braces get treated as literal text.

The original skill mentioned this too, but it was tucked away in a table row. Autoresearch escalated it twice: first to inline instructions, then to a **CRITICAL** bold warning at the top of the prompt creation step. Each escalation improved the score.

2. The inventory step

On complex cases (12 prompts across 9 files), the agent would often start modifying code before it had a full picture of what needed to change, and then miss prompts or lose track.

Autoresearch added an explicit step: "Before writing ANY code, make a complete list of every prompt you found" with 6 required fields (name, source file, code file to refactor, type, variables, prompt content). This forced the agent to plan before acting. Case 04 went from erratic results to consistently scoring above 0.88. Planning before acting is obvious advice for humans; turns out agents need it spelled out.

Where It Did Something Irrelevant or Harmful

1. Skipping doc fetching

The main skill file says "Documentation First: always fetch current docs before writing code" because Langfuse updates frequently. Autoresearch changed this to "For tasks covered by a reference file, follow that file directly." This saves turns and improves the efficiency score. It also means the skill will silently use outdated instructions when the API changes. While this improved executions for the test harness today, it will break in production tomorrow.

2. Removing the approval gate and switching to curl

This was autoresearch's single biggest score improvement — cases 01-03 went from 0.00 correctness to 1.00 in one experiment. The original skill had a step where the agent presents its migration plan and waits for user approval before modifying code. In the autoresearch harness, there's no human, so that approval never comes and the agent just plans forever. Removing it makes the skill worse for its actual use case, where you probably want to review what it's about to do.

Similarly, autoresearch switched from langfuse.create_prompt() (Python SDK) to raw curl commands against the REST API, because pip install langfuse would fail in the sandboxed test environments. In real user repositories where langfuse is already installed, the SDK is cleaner and less error-prone.

This was not completely the agent's fault. It's a good example of how not spending enough time on making the agent harness representative for the real world led to suboptimal results.

3. Removing subprompts and trace linking

The original skill had full sections on identifying shared text across prompts (subprompts) and linking prompts to traces for observability. Autoresearch removed both entirely. Why? Because none of the 6 test cases covered these features. If it's not measured, it gets cut. From a product perspective, these are features users actually need — the skill now can't help with a common post-migration step.

Your Target Function and Agent Harness are Everything

Autoresearch optimizes for exactly what you measure given the context you execute in. If your target function has gaps, it will find and exploit them. The community around autoresearch has been raising this same concern: it's Goodhart's Law at machine speed. Whatever metric you expose, the agent will exploit it relentlessly.

Next to that, common criticism of autoresearch in the broader community is validation set overfitting. Run hundreds of experiments against a fixed set of test cases, and you end up optimizing for quirks of that specific data.

The problem is: most people will not build a complete enough target function. We didn't.

For a skill that does one narrow thing, it's feasible to build a target function that covers the full surface area. And autoresearch will probably give you great results. For example, Shopify's Tobi Lutke applied autoresearch to their Liquid templating engine — a narrow, well-defined optimization target — and got 53% faster rendering and 61% fewer memory allocations from 93 automated commits. He still noted the overfitting though.

tobi lutke

@tobi

·Follow

OK, well. I ran /autoresearch on the the liquid codebase. 53% faster combined parse+render time, 61% fewer object allocations. This is probably somewhat overfit, but there are absolutely amazing ideas in this.

9:49 PM · Mar 12, 2026

2.9K

Read 104 replies

For broad skills like this one, the surface area is too large to get everything into the target function and harness. So treat the output as inspiration. And spend enough time on the preparation. The workflow is not "run it and commit the result." It's:

Spend enough time on the setup to get the harness and target function right. As the community around autoresearch has noted, the human job moves from "can you implement this?" to "can you write a good program.md that produces useful research?"
Let it run
Review critically: read every change, understand why it was made, ask whether it's a real improvement or a harness/target function artifact
Cherry-pick the relevant improvements, discard the overfitting

The bottom line: review it like a junior engineer's PR. They'll have good ideas mixed with bad ones. Some changes will be insightful. Others will be shortcuts that happen to pass the tests, like removing the approval gate, switching to curl. In ambiguous settings like this, it's your job to separate them.

Would We Do It Again?

Yes. Despite everything above, autoresearch was definitely useful.

It tested the skill far more than we ever would have manually: 14 experiments across 6 codebases, dozens of full agent runs. It surfaced failure modes we hadn't considered (the double-brace issue, the planning-before-acting problem). It's valuable to have a process that stress-tests your skill from angles you wouldn't have thought of.

Using autoresearch for prompt optimization

Beyond skills, the same pattern could apply to prompt optimization. We've explored prompt improvement workflows before, and automating that loop with autoresearch is a natural next step. Though the same caveats apply: if your evaluation dataset is too narrow or your target function too simple, you'll overfit just as efficiently.

Was this page helpful?

PreviousHow We Built an Agent Skill to Synthesize what Langfuse Users want

NextLangfuse March Update