FOOD SCIENCE × AI
AI agent system for food scientists: formula generation, ingredient modification, reverse engineering from nutrition labels. Built a context engineering observability system that lets AI debug and improve itself.
ROLE
Lead AI Engineer
YEAR
2023-2025
STACK
Next.js · Python · OpenAI Agents SDK
STATUS
In Production
Reverse engineering a food formula from a nutrition label is a systematic process of narrowing an infinite solution space. Each iteration eliminates impossible combinations based on macros, nutrients, ingredient constraints, and category-specific rules.
Food scientists do this through a combination of linear programming, domain knowledge, and intuition built over years. The question: Can an AI agent learn this process well enough to do it reliably?
Even ChatGPT Pro ($200/month) gets confused in predictable ways. This wasn't a prompting problem. It was a context engineering problem.
This project evolved across two phases as AI capabilities expanded.
Built a system for 30+ page research reports (before deep research products existed). GPT-4 Mini acted as orchestrator and summarizer; Claude did the actual writing. Rolling summaries kept Claude on track across sections without repetition or context loss.
Worked directly with food scientists to deconstruct their reverse engineering process. Learned the math (linear programming, nutrient calculations), the heuristics, the category-specific rules. This became the foundation for Phase 2.
Embedding-based retrieval system for matching ingredients to USDA database entries. Handles fuzzy matching, brand variations, and composite ingredients.
Deconstructed the entire reverse-from-label process into discrete tools and agents. Math operations, nutrient validation, ingredient lookup, substitution checking. Each became a callable tool. The orchestrating agent dynamically adjusts its approach based on label complexity.
Standard LLM observability (Braintrust, Langsmith, Helicone) only answers "how did we ask?" They're prompt engineering tools. For complex agent systems, you need to answer "what did the LLM know?" What data reached each agent? Was it accurate? What's missing from the pipeline? I built a harness that provides observability over context itself, not just prompts.
TACE (total absolute composition error) and MAPE metrics against golden datasets with known-good USDA formulas. Reward function: TACE under 10. This let me measure whether context improvements actually worked, and iterate systematically instead of guessing.
The breakthrough: Claude Code drives the eval harness. Agents report back why they failed. Claude reads the feedback, applies fixes, reruns the tests. The system literally improves itself. I could run this loop for hours, across multiple sessions using Agent Focus for context handoffs.
SELF-IMPROVING DEVELOPMENT LOOP
RUN EVAL
Execute test harness on golden datasets
COLLECT FEEDBACK
Agents report why they failed
ANALYZE
Claude reads failure patterns
FIX
Apply targeted improvements
REPEAT
Loop until metrics converge
30→5
TACE SCORE
Average error reduction
20+
AGENT COMPONENTS
Tools, validators, orchestrators
∞
SELF-IMPROVING
Autonomous optimization loop
TACE scores dropped from 30-60 (unreliable) to 0-25 (production-ready, even on adversarial labels). The system is now in production, handling real reverse engineering requests.
The leverage hierarchy: Context Engineering > Prompt Engineering > Math Tweaks. For complex AI applications, you can't fix problems by editing prompts in a UI. You need observability over the entire context pipeline: what data reaches each agent, whether it's accurate, what's missing.
This system doesn't replace food scientists. It amplifies them. Domain experts can now iterate on formulations faster, validate nutrition data automatically, and explore possibilities that would take days to compute manually.
The self-improving development loop is the meta-innovation: AI that can debug and improve itself across multi-day development cycles, with humans setting goals rather than writing fixes.