EsotericPapers

CONCEPT

Self-Improving Eval Loops

AI agents report their own failures in structured form. Claude Code reads the reports and applies fixes automatically. The human sets the goal, the system figures out how to get there.

THE TRADITIONAL WAY

Most AI development is manual:

  1. 01Run tests
  2. 02Read failures
  3. 03Fix code manually
  4. 04Repeat

This works but doesn't scale. For Food Science AI, I needed to run hundreds of eval iterations to get TACE scores from 30-60 down to 0-25. Manual iteration would take months.

THE INNOVATION

Invert the loop. Make the AI debug itself.

THE AUTONOMOUS LOOP

01

RUN EVAL

Execute test harness on golden datasets

02

COLLECT

Agents report why they failed

03

ANALYZE

Claude reads failure patterns

04

FIX

Apply targeted improvements

05

REPEAT

Loop until metrics converge

Powered by Claude Code + Agent Focus∞ ITERATIONS

HOW IT WORKS

Agents Report Failures

The reverse engineering agent doesn't just fail. It reports WHY it failed in structured form. "mathBaseline not wired to prompt." "Missing ingredient nutrient data for quinoa." Specific, actionable feedback.

Claude Code Drives the Harness

Claude Code runs the eval harness, collects agent feedback, analyzes patterns, applies fixes to tools/prompts/context, reruns evals. All autonomous. The human sets the goal (TACE < 10), the system iterates until it achieves it.

Multi-Day Optimization

Context windows fill up. Use Agent Focus to hand off at 150k tokens. Next session resumes the optimization loop with refined context. Could run for days across dozens of handoffs.

Real Example

Session 04: Claude runs evals:golden3 → TACE 35pp. Analyzes feedback: "mathBaseline not wired to prompt." Implements fix. Reruns evals → TACE 12pp. One iteration, 66% improvement. Repeated across dozens of sessions until production-ready.

WHY IT MATTERS

This is meta-AI development: AI that debugs and improves AI systems. The human provides the reward function (TACE score), the system optimizes autonomously.

Most AI development requires constant human intervention. Self-improving eval loops enable AI systems to optimize themselves across days or weeks, with humans setting goals rather than implementing fixes.