Tool-Augmented Reward Modeling

Lei Li; Yekun Chai; Shuohuan Wang; Yu Sun; Hao Tian; Ningyu Zhang; Hua Wu

Tool-Augmented Reward Modeling

Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua Wu

TL;DR

The paper tackles the limitations of vanilla reward models (RMs) in RLHF, notably their static knowledge and arithmetic/lookup gaps, by introducing Themis, a tool-augmented RM that learns when and how to invoke external tools and to generate stepwise reasoning traces. Themis combines a pairwise RM objective with autoregressive tool invocation and rationale training, supported by the Tool-Augmented Reward Dataset (TARA) that pairs questions, high-quality positive/negative answers, and tool-invocation traces across seven tools. Empirical results show Themis achieves substantial gains over vanilla RMs (+17.7% average across eight tasks in mixed-tool mode; +19.2% in single-tool mode), reaches perfect 100% on Calendar and Weather tasks, and outperforms baselines on TruthfulQA and Retarded-bar; in RLHF contexts, Themis yields a +32% win rate in human preference evaluations. The work advances practical reward modeling by enabling interpretability, reliability, and generalization through tool integration, with broad implications for improving truthfulness and factuality in LLM alignment and downstream RLHF/RLTAF settings.

Abstract

Reward modeling (a.k.a., preference modeling) is instrumental for aligning large language models with human preferences, particularly within the context of reinforcement learning from human feedback (RLHF). While conventional reward models (RMs) have exhibited remarkable scalability, they oft struggle with fundamental functionality such as arithmetic computation, code execution, and factual lookup. In this paper, we propose a tool-augmented preference modeling approach, named Themis, to address these limitations by empowering RMs with access to external environments, including calculators and search engines. This approach not only fosters synergy between tool utilization and reward grading but also enhances interpretive capacity and scoring reliability. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources and construct task-specific tool engagement and reasoning traces in an autoregressive manner. We validate our approach across a wide range of domains, incorporating seven distinct external tools. Our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. Furthermore, our approach outperforms Gopher 280B by 7.3% on TruthfulQA task in zero-shot evaluation. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines across four distinct tasks. Additionally, we provide a comprehensive collection of tool-related RM datasets, incorporating data from seven distinct tool APIs, totaling 15,000 instances. We have made the code, data, and model checkpoints publicly available to facilitate and inspire further research advancements\footnote{\url{https://github.com/ernie-research/Tool-Augmented-Reward-Model}}.

Tool-Augmented Reward Modeling

TL;DR

Abstract

Paper Structure (51 sections, 2 equations, 12 figures, 20 tables)

This paper contains 51 sections, 2 equations, 12 figures, 20 tables.

Introduction
Tool-Augmented Reward Modeling
Revisiting Reward Models
Themis: Tool-Augmented Reward Modeling
Training Objectives
Connection to Vanilla RM
Tool-Augmented Reward Dataset
Data Collection
Data Statistics
Experiments
Experimental Settings
Main Results
Single-Tool vs. Mixed-Tool Performance.
Scaling Trends in Themis.
Effect of Varying Training Epochs.
...and 36 more sections

Figures (12)

Figure 1: A diagram illustrating the pipeline of (a) Vanilla reward models (RMs); (b) Tool-augmented RMs, namely Themis; (c) Reinforcement learning via proximal policy optimization (PPO) on above RMs; (d) Examples of single or multiple tool use process in the proposed approach. See Section \ref{['sec:method']} for more details of our method.
Figure 2: An illustration of data creation pipline for our Tool-Augmented DatAset (TARA).
Figure 3: Left: Model performance for various training epoch numbers; Right: Visualization of the change of average reward scores with training epochs. The top reward score line of each model corresponds to the positive answer, while the bottom line corresponds to the negative answer.
Figure 4: Left: The variations in the number of correctly invoked tools and incorrectly invoked tools. The dashed line is the total number of invoked tools in TARA. And the pentagram refers to the best performance epoch. Right: Comparison of the number of invoked different tools.
Figure 5: An example of the Weather tool.
...and 7 more figures

Tool-Augmented Reward Modeling

TL;DR

Abstract

Tool-Augmented Reward Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (12)