Table of Contents
Fetching ...

AutoHarness: improving LLM agents by automatically synthesizing a code harness

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy

TL;DR

It is demonstrated that Gemini-2.5-Flash can automatically synthesize a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment, enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro and GPT-5.2-High.

Abstract

Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.

AutoHarness: improving LLM agents by automatically synthesizing a code harness

TL;DR

It is demonstrated that Gemini-2.5-Flash can automatically synthesize a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment, enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro and GPT-5.2-High.

Abstract

Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.
Paper Structure (27 sections, 7 figures)

This paper contains 27 sections, 7 figures.

Figures (7)

  • Figure 1: Code-as-harness learning process.
  • Figure 2: Fraction of legal moves vs number of code refinements for a selection of 6 games.
  • Figure 3: Win/lose/draw rate of our method vs Gemini-2.5-Pro for each of the 16 2P games.
  • Figure 4: Average reward of our method and Gemini-2.5-Pro for each of the 16 1P games.
  • Figure 5: Average reward of different agents across 16 TextArena 1P games.
  • ...and 2 more figures