DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Tzu-Han Lin; Chen-An Li; Hung-yi Lee; Yun-Nung Chen

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Tzu-Han Lin, Chen-An Li, Hung-yi Lee, Yun-Nung Chen

TL;DR

The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

Abstract

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

TL;DR

Abstract

Paper Structure (46 sections, 7 equations, 15 figures, 6 tables)

This paper contains 46 sections, 7 equations, 15 figures, 6 tables.

Introduction
Related Work
Reward Modeling
Model Merging
Methodology
Reward Modeling
Model Merging
Experiments
Experimental Setup
Reward Model
Domain-Specific SFT
Evaluation
Results
RM Benchmarks
Best-of-N Sampling
...and 31 more sections

Figures (15)

Figure 1: The framework of DogeRM, illustrating the merging of a general RM with a domain-specific LM to create a domain-specific RM. All icons used in this figure are sourced from https://www.flaticon.com/.
Figure 2: Best-of-N results. Merging with domain-specific models improves reranking accuracy. Topline: Pass@N, the probability of obtaining at least one correct solution out of N responses. Baseline: LLaMA-2 RM.
Figure 3: The impact of different value of $\lambda$ on RewardBench math and code subsets. (a)(b): Accuracy; (c)(d): Reward difference between chosen and rejected prompts.
Figure 4: Full results of LLaMA-2 RM + MetaMath on GSM8K.
Figure 5: Full results of LLaMA-2 RM + MAmmoTH on GSM8K.
...and 10 more figures

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

TL;DR

Abstract

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Authors

TL;DR

Abstract

Table of Contents

Figures (15)