Table of Contents
Fetching ...

Understanding the Challenges in Iterative Generative Optimization with LLMs

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng

Abstract

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

Understanding the Challenges in Iterative Generative Optimization with LLMs

Abstract

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.
Paper Structure (78 sections, 6 equations, 29 figures, 10 tables)

This paper contains 78 sections, 6 equations, 29 figures, 10 tables.

Figures (29)

  • Figure 1: The learning loop of generative optimization.
  • Figure 2: Three Key Decisions for Implementing a Learning Loop. To set up iterative generative optimization, an agent engineer must make three core decisions. For the initial system: What (1) starting artifacts ( instructions, files, specs) to provide? Different initializations can lead to different solution spaces. For the learning context (): What learning evidence to provide to the LLM optimizer -- (2) how many steps to include per trial (credit horizon) and (3) how many trials to batch together (experience batching).
  • Figure 3: Different Starting Artifacts for MLAgentBench. We compare two initialization schemes for the ML pipeline creation task. Left: One-function approach where the LLM optimizer implements and modifies a single train_model function that handles the entire pipeline from data ingestion to prediction. Right: Many-function approach where the pipeline is decomposed into modular components. Both initializations contain equivalent information in their docstrings; the only difference is the level of modularization.
  • Figure 4: Kaggle leaderboard performance, reported as the $\text{percentile}$ of the trained ML model submissions (higher is better).
  • Figure 5: Credit Horizon Comparison Across Games. Performance of agents optimized with one-step (immediate reward) vs multi-step (full rollout) credit horizons across 5 trials. Both setups use agents with the same starting artifacts. We see that observing full execution traces (multi-step) is only useful in discovering better code in 4 out of 8 games, suggesting credit horizon is a design choice and can be tuned to each task.
  • ...and 24 more figures