Table of Contents
Fetching ...

AgentRM: Enhancing Agent Generalization with Reward Modeling

Yu Xia, Jingru Fan, Weize Chen, Siyu Yan, Xin Cong, Zhong Zhang, Yaxi Lu, Yankai Lin, Zhiyuan Liu, Maosong Sun

TL;DR

This work tackles the generalization gap in LLM-based agents by introducing AgentRM, a generalizable reward model that guides policy decisions at test time via search. It systematically compares three reward-modeling paradigms—explicit, implicit, and LLM-as-a-judge—and finds explicit reward modeling to be the most effective across nine tasks, including transfer to larger policy models $($e.g., from $8$B to $70$B$).$ The results show substantial improvements over both general and task-specific baselines, with average gains of $8.8$ points and notable cross-task transfer improvements (e.g., $12.6$ points on $\text{LLaMA-3-70B}$). The study also analyzes data scaling, robustness to perturbations, state representations, and test-time search scaling, demonstrating the practicality and versatility of a generalizable reward model for agent reasoning and planning.

Abstract

Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model. Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search. We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge. We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search. On four types of nine agent tasks, AgentRM enhances the base policy model by $8.8$ points on average, surpassing the top general agent by $4.0$. Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement of $12.6$ on LLaMA-3-70B policy model. As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by $11.4$ on three held-in tasks. Further analysis verifies its effectiveness in test-time scaling. Codes will be released to facilitate the research in this area.

AgentRM: Enhancing Agent Generalization with Reward Modeling

TL;DR

This work tackles the generalization gap in LLM-based agents by introducing AgentRM, a generalizable reward model that guides policy decisions at test time via search. It systematically compares three reward-modeling paradigms—explicit, implicit, and LLM-as-a-judge—and finds explicit reward modeling to be the most effective across nine tasks, including transfer to larger policy models e.g., from B to B The results show substantial improvements over both general and task-specific baselines, with average gains of points and notable cross-task transfer improvements (e.g., points on ). The study also analyzes data scaling, robustness to perturbations, state representations, and test-time search scaling, demonstrating the practicality and versatility of a generalizable reward model for agent reasoning and planning.

Abstract

Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model. Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search. We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge. We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search. On four types of nine agent tasks, AgentRM enhances the base policy model by points on average, surpassing the top general agent by . Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement of on LLaMA-3-70B policy model. As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by on three held-in tasks. Further analysis verifies its effectiveness in test-time scaling. Codes will be released to facilitate the research in this area.

Paper Structure

This paper contains 40 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Finetuning the reward model is more robust than finetuning the policy model for agent tasks. (a) Finetuning the policy model leads to severe degradation on held-out tasks. (b)(c) show the performance of Best-of-5 with a reward model. Finetuning the policy model on one task degrades on others while finetuning the reward model mostly generalized to others.
  • Figure 2: Overview. ❶ Deriving a supervised fine-tuned (SFT) agent on expert trajectories. ❷ Constructing search trees by exploring the environment using the SFT agent. ❸ Training a generalizable reward model, on state-reward pairs extracted from search trees. ❹ Enhancing the policy model, regardless of its initial strength, through test-time search guided by our reward model for unseen tasks such as embodied planning, text game, tool using etc.
  • Figure 3: Scaling trend of training data.
  • Figure 4: Performance of task-specific RM on 9 tasks. The red/orange/blue bar denotes RM trained on Webshop/Alfworld/Sciworld respectively. The dashed line denotes the performance of the general RM.
  • Figure 5: Scaling trend of Best-of-N.