Table of Contents
Fetching ...

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan

TL;DR

The paper tackles the challenge of enabling LLM-based agents to perform multi-step decision-making in interactive environments without heavy reliance on expensive APIs or labor-intensive labeling. It introduces ARMAP, a framework that automatically learns a task-specific reward signal from environment interactions by generating positive and negative trajectories with LLM navigators and refining task intents, then training a Vision-Language scoring backbone (VILA) to evaluate trajectory satisfaction. This learned reward signal is integrated with planning algorithms (Best-of-N, Reflexion, MCTS) to improve action planning across diverse benchmarks including Webshop, ScienceWorld, and Game of 24, with demonstrated controllable generation via reward-target customization. The results show robust improvements across model sizes, data efficiency, and cross-domain applicability, highlighting ARMAP’s potential to reduce labeling needs and API dependence while enabling flexible, goal-directed autonomous agents.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

TL;DR

The paper tackles the challenge of enabling LLM-based agents to perform multi-step decision-making in interactive environments without heavy reliance on expensive APIs or labor-intensive labeling. It introduces ARMAP, a framework that automatically learns a task-specific reward signal from environment interactions by generating positive and negative trajectories with LLM navigators and refining task intents, then training a Vision-Language scoring backbone (VILA) to evaluate trajectory satisfaction. This learned reward signal is integrated with planning algorithms (Best-of-N, Reflexion, MCTS) to improve action planning across diverse benchmarks including Webshop, ScienceWorld, and Game of 24, with demonstrated controllable generation via reward-target customization. The results show robust improvements across model sizes, data efficiency, and cross-domain applicability, highlighting ARMAP’s potential to reduce labeling needs and API dependence while enabling flexible, goal-directed autonomous agents.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.
Paper Structure (29 sections, 2 equations, 17 figures, 14 tables)

This paper contains 29 sections, 2 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: In Fig. \ref{['fig:teaser']} (a), we show that it is difficult for LLM agents to generate multi-step plans in an interactive environment to achieve the instruction goal. However, it is relatively easy for an LLM to learn a reward model that can evaluate whether the trajectories meet the task instructions, as shown in Fig. \ref{['fig:teaser']} (b). In Fig. \ref{['fig:teaser']} (c), we show that a learned reward model can be used to guide the default policy models to improve action planning.
  • Figure 2: The pipeline of our ARMAP framework. We first generate an initial task instruction using LLMs with in-context learning and sample trajectories aligned with the initial language instructions in the environment. Next, we use the LLM to summarize the sampled trajectories and generate refined task instructions that better match these trajectories. We then modify specific actions within the trajectories to perform new actions in the environment, collecting negative trajectories in the process. Using the refined task instructions, along with both positive and negative trajectories, we train a lightweight reward model to distinguish between matching and non-matching trajectories. The learned reward model can then collaborate with various LLM agents to improve task planning.
  • Figure 3: Two qualitative results of the Webshop task. The figure shows two examples utilizing the advantages of our ARMAP framework and we are able to correct errors made by existing methods. In the top example, when the search results do not meet the requirements, our ARMAP method leverages the advantage of the tree structure to backtrack and search again, thereby retrieving the appropriate target item. In contrast, existing methods fail to backtrack when the target item is not found. In the bottom example, by using the ARMAP to evaluate different states in the environment, our method is able to select the color that offers a higher reward and better meets the requirements when choosing between size and color, rather than mistakenly selecting the wrong size. These two examples sufficiently demonstrate the advantages of our method compared to traditional approaches.
  • Figure 4: A typical example of customized reward target for shorter trajectory generation. On the left, we show the default greedy decoding generates a long trajectory without finding the target product. In the middle, we show our default reward can guide the LLM agent to generate a correct but long trajectory. On the right, we show our framework with a customized reward target for shorter trajectories, which finds a correct and short trajectory for the target product.
  • Figure 5: Training Data Example for Webshop.
  • ...and 12 more figures