CONCEPT
AI agents report their own failures in structured form. Claude Code reads the reports and applies fixes automatically. The human sets the goal, the system figures out how to get there.
Most AI development is manual:
This works but doesn't scale. For Food Science AI, I needed to run hundreds of eval iterations to get TACE scores from 30-60 down to 0-25. Manual iteration would take months.
Invert the loop. Make the AI debug itself.
THE AUTONOMOUS LOOP
RUN EVAL
Execute test harness on golden datasets
COLLECT
Agents report why they failed
ANALYZE
Claude reads failure patterns
FIX
Apply targeted improvements
REPEAT
Loop until metrics converge
The reverse engineering agent doesn't just fail. It reports WHY it failed in structured form. "mathBaseline not wired to prompt." "Missing ingredient nutrient data for quinoa." Specific, actionable feedback.
Claude Code runs the eval harness, collects agent feedback, analyzes patterns, applies fixes to tools/prompts/context, reruns evals. All autonomous. The human sets the goal (TACE < 10), the system iterates until it achieves it.
Context windows fill up. Use Agent Focus to hand off at 150k tokens. Next session resumes the optimization loop with refined context. Could run for days across dozens of handoffs.
Session 04: Claude runs evals:golden3 → TACE 35pp. Analyzes feedback: "mathBaseline not wired to prompt." Implements fix. Reruns evals → TACE 12pp. One iteration, 66% improvement. Repeated across dozens of sessions until production-ready.
This is meta-AI development: AI that debugs and improves AI systems. The human provides the reward function (TACE score), the system optimizes autonomously.
Most AI development requires constant human intervention. Self-improving eval loops enable AI systems to optimize themselves across days or weeks, with humans setting goals rather than implementing fixes.