Table of Contents
Fetching ...

Language Models can Self-Improve at State-Value Estimation for Better Search

Ethan Mendes, Alan Ritter

TL;DR

Language Models can Self-Improve at State-Value Estimation for Better Search tackles the cost and practicality of ground-truth rewards in multi-step reasoning. It introduces Self-Taught Lookahead (STL), a reward-free framework where a value LM learns to predict the next action, resulting state, and a value rationale, then uses these rationales to fine-tune its value estimates via lookahead rollouts. STL yields more accurate state-value predictions, enabling efficient search that expands fewer states while maintaining strong performance across web tasks, multi-hop QA, and math puzzles, even approaching or matching proprietary models with open 8B LLMs. The approach also demonstrates favorable compute and environmental efficiency, suggesting STL as a viable path for deploying capable, resource-conscious agent systems in real-world interactive domains.

Abstract

Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, particularly in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language - predicting the next action, resulting state, and rationale for its value, thereby refining value estimates without any labeled data. This self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B parameter) open-weight LLMs boost web agent success rates by 39%, achieving comparable performance with proprietary models. STL also generalizes to multi-hop QA and math puzzles. We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.

Language Models can Self-Improve at State-Value Estimation for Better Search

TL;DR

Language Models can Self-Improve at State-Value Estimation for Better Search tackles the cost and practicality of ground-truth rewards in multi-step reasoning. It introduces Self-Taught Lookahead (STL), a reward-free framework where a value LM learns to predict the next action, resulting state, and a value rationale, then uses these rationales to fine-tune its value estimates via lookahead rollouts. STL yields more accurate state-value predictions, enabling efficient search that expands fewer states while maintaining strong performance across web tasks, multi-hop QA, and math puzzles, even approaching or matching proprietary models with open 8B LLMs. The approach also demonstrates favorable compute and environmental efficiency, suggesting STL as a viable path for deploying capable, resource-conscious agent systems in real-world interactive domains.

Abstract

Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, particularly in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model-based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language - predicting the next action, resulting state, and rationale for its value, thereby refining value estimates without any labeled data. This self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B parameter) open-weight LLMs boost web agent success rates by 39%, achieving comparable performance with proprietary models. STL also generalizes to multi-hop QA and math puzzles. We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.

Paper Structure

This paper contains 61 sections, 6 equations, 16 figures, 9 tables, 2 algorithms.

Figures (16)

  • Figure 1: The information accessible during learning and inference across common search settings, exemplified using web tasks. Our Self-Taught Lookahead method is Reward and Demo Free , yet is able to self-improve by learning from state transitions in the form of lookahead values and rationales.
  • Figure 2: Self-taught lookahead self-improves the value model by learning from state-transition dynamics. During the data generation phase (top left), tree search is used to discover diverse states. For every observed state $s$ encountered during the search, successor states are expanded using base policy $\pi_\theta$ and the current value model $V_{\phi_k}$, and a textual training example is formed using verbal representations of the next best action and successor state, as well as $V_{\phi_{k}}$'s outputted value reasoning ($r$) and numerical value ($v$) discounted by $\gamma$ (top middle). These examples are used to fine-tune $V_{\phi_{k + 1}}$, which will be used in the next iteration of the algorithm (top right). Value models learned during STL can be used to evaluate unseen states encountered during search on unseen tasks by simulating a step of lookahead, including the next best action and the best successor state $\Tilde{s}'$ (bottom).
  • Figure 3: BFS Game-of-24 performance on tasks seen and unseen during STL.
  • Figure 4: Compute and environmental efficiency during evaluation on WebShop with a gpt-3.5-turbo policy (left). Compute efficiency is measured in total (prompt and completion) tokens. Environmental efficiency is measured by the number of states expanded (webpages visited). The distribution of tokens (closed vs. open source models) used during search is also shown (right). Value models are specified in parentheses.
  • Figure 5: Tradeoff between performance and efficiency on WebShop with a gpt-3.5-turbo policy. Pareto frontiers of existing methods and baselines are shown, illustrating the optimality of STL when considering the tradeoff between inference cost and average reward (left) and between environmental usage and average reward (right). Reward-Guided Inference methods are presented in gold and not included in the Pareto frontier since they belong to a different information setting.
  • ...and 11 more figures