Self-Evolving Autoresearch Workflow Loops

In this article we explain how we ported evo's autoresesarch loop to use workflows and then also made it dynamic.

On June 2 Anthropic shipped dynamic workflows in Claude Code: Claude writes a small JavaScript program on the fly that spawns and coordinates subagents. The coordination runs as code; the model does the judgment. The thing to take away is that orchestration itself moved off the model's decison and can now by described as code. h/t @trq212's writeup

what evo is

evo is an autoresearch orchestrator. You give it a system, a definition of "better," and a budget. It generates hypotheses, runs each one in its own isolated workspace, scores it, and keeps a tree of attempts - extending what works, pruning what doesn't - while an auditor checks every accepted change so the optimizer can't game the metric. Open source; runs on Claude Code, Codex, Cursor, and others.

why we moved the loop onto workflows

The loop used to be orchestrated in-context, as one long agent run holding the whole plan: which phase comes next, how many experiments to launch, when to stop. evo does autoresearch in an opinionated way, and at every step the agent has to follow that method and drive the CLI we ship alongside it. Over a long autoresearch run, getting the agent to adhere to all of that was tricky. Prompt and instruction adherence is unreliable on long-horizon tasks: across dozens of rounds the standing rules (run this phase, use this CLI command, dedupe the briefs, keep the gate strict etc) quietly stop happening, and the longer a single context runs, the less it holds.

Moving the loop onto a dynamic workflow fixes that at the root. The method is the code now: the phases, the fan-out width, the stopping rule, the gates, and the CLI calls are part of the script, deterministic and the same on round 1 and round 1000. Adherence stops being something the model has to remember. Every step is a fresh, scoped subagent with one job and a clean context, so there's nothing to drift. The model does judgment; the code does coordination.

what the evo autoresearch workflow runs: one round

Each round of the optimize loop walks the same six steps, in code:

Orient: Read the experiment tree: best score, the ceiling, the open frontier. Take the top width frontier nodes as this round's parents.

Scan: Agents comb the evaluated nodes in parallel for what's working and what's failing, while an aggregate agent looks for patterns across the whole tree.

Ideate: On a stall, three research agents fire at once: one extrapolates the best branch, one dissects the failures, one reads the literature and the web.

Brief. A writer folds the scan findings, the patterns, and the ideas into concrete experiment briefs, then dedupes them.

Fan-out. One lane per brief, in parallel. Each lane implements the change, pre-verifies it (and revises if it's gaming the metric), runs it, then post-audits with the verifier.

Collect. Prune dead lineages, record notes, and repeat until the score stops improving.

It worked, but now the workflow still ran same shape every round: the same phases (orient, scan, ideate, brief, fan-out, collect), the same steps, the same prompts, no matter what the run had learned about itself. A long run turns up things a fixed shape can't handle: one experiment class needs a verifier step the loop doesn't have, another needs a specific method injected, a phase stops earning its value and should come out.

now: the loop evolves itself

evo 0.5 makes the optimize loop self-evolving. A second workflow runs alongside the first. Two async loops on one event loop, joined with `Promise.all`:

- the optimize loop is the driver, the above defined workflow, unchanged

- the meta loop is a concurrent observer: a fresh agent that wakes every few minutes, reads the run from the outside, and rewrites the optimize loop while it runs

They share one plain object, the harness: the steps the loop runs, the phases and the prompts they use, the gates and verifiers in play (alongside knobs like width and stall that were always adjustable). The optimizer reads it every round; the meta thread writes it. Same event loop, so writes land between the optimizer's awaits, with no locks and no second process.

## p

what the meta can do

Each tick it observes the tree, the scores, the live logs, GPU and host state (strictly read-only), and emits four kinds of output:

harness edits: the real lever. A run surfaces needs specific to it: this experiment class wants its own verifier step, that one needs a particular method injected, another step turns out to be dead weight and should be cut. The meta adapts the workflow to fit, injecting steps, removing them, and rewriting the phases that run. It takes effect on the next round. The loop's shape becomes data the system changes to match what the run actually needs.

brief hints: softer; queued into the next round's brief to nudge what it tries next.

stops: when an experiment is going nowhere the meta doesn't kill it. It hands a recommendation to a separate gated enforcer that verifies, aborts, diagnoses, and discards. Detect and act stay separate; never a silent kill.

alerts: runtime problems it can't fix itself (eg a dying GPU) go to a human.

We have found that having an external observer / meta agent look at the experiments and nudge it to be very effective in course correcting and catching issues

Takeaways

Dynamic workflows make coordination code instead of context. What that buys you is that: the loops becomes a first-class object, something you can read, edit, and reason about while it runs, instead of a harness you write once and hope fits every round. The loop's own shape is one more parameter space that can be evolved.

it's all open

evo is opensource. you go through our dynamic workflow implementation here

X Article

Found something good?