AgentRM: Enhancing Agent Generalization with Reward Modeling
Yu Xia, Jingru Fan, Weize Chen, Siyu Yan, Xin Cong, Zhong Zhang, Yaxi Lu, Yankai Lin, Zhiyuan Liu, Maosong Sun
TL;DR
This work tackles the generalization gap in LLM-based agents by introducing AgentRM, a generalizable reward model that guides policy decisions at test time via search. It systematically compares three reward-modeling paradigms—explicit, implicit, and LLM-as-a-judge—and finds explicit reward modeling to be the most effective across nine tasks, including transfer to larger policy models $($e.g., from $8$B to $70$B$).$ The results show substantial improvements over both general and task-specific baselines, with average gains of $8.8$ points and notable cross-task transfer improvements (e.g., $12.6$ points on $\text{LLaMA-3-70B}$). The study also analyzes data scaling, robustness to perturbations, state representations, and test-time search scaling, demonstrating the practicality and versatility of a generalizable reward model for agent reasoning and planning.
Abstract
Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model. Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search. We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge. We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search. On four types of nine agent tasks, AgentRM enhances the base policy model by $8.8$ points on average, surpassing the top general agent by $4.0$. Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement of $12.6$ on LLaMA-3-70B policy model. As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by $11.4$ on three held-in tasks. Further analysis verifies its effectiveness in test-time scaling. Codes will be released to facilitate the research in this area.
