Table of Contents
Fetching ...

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Tzu-Han Lin, Chen-An Li, Hung-yi Lee, Yun-Nung Chen

TL;DR

The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

Abstract

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

TL;DR

The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

Abstract

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.
Paper Structure (46 sections, 7 equations, 15 figures, 6 tables)

This paper contains 46 sections, 7 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The framework of DogeRM, illustrating the merging of a general RM with a domain-specific LM to create a domain-specific RM. All icons used in this figure are sourced from https://www.flaticon.com/.
  • Figure 2: Best-of-N results. Merging with domain-specific models improves reranking accuracy. Topline: Pass@N, the probability of obtaining at least one correct solution out of N responses. Baseline: LLaMA-2 RM.
  • Figure 3: The impact of different value of $\lambda$ on RewardBench math and code subsets. (a)(b): Accuracy; (c)(d): Reward difference between chosen and rejected prompts.
  • Figure 4: Full results of LLaMA-2 RM + MetaMath on GSM8K.
  • Figure 5: Full results of LLaMA-2 RM + MAmmoTH on GSM8K.
  • ...and 10 more figures