Automatically Finding Reward Model Biases

Atticus Wang; Iván Arcuschin; Arthur Conmy

Automatically Finding Reward Model Biases

Atticus Wang, Iván Arcuschin, Arthur Conmy

TL;DR

This work offers a simple approach of using an LLM to iteratively propose and refine candidate biases and shows evidence that evolutionary iteration outperforms flat best-of-N search and the recall of the pipeline using synthetically injected biases.

Abstract

Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious or undesirable attributes such as length, format, hallucinations, and sycophancy. In this work, we introduce and study the research problem of automatically finding reward model biases in natural language. We offer a simple approach of using an LLM to iteratively propose and refine candidate biases. Our method can recover known biases and surface novel ones: for example, we found that Skywork-V2-8B, a leading open-weight reward model, often mistakenly favors responses with redundant spacing and responses with hallucinated content. In addition, we show evidence that evolutionary iteration outperforms flat best-of-N search, and we validate the recall of our pipeline using synthetically injected biases. We hope our work contributes to further research on improving RMs through automated interpretability methods.

Automatically Finding Reward Model Biases

TL;DR

Abstract

Paper Structure (30 sections, 4 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 30 sections, 4 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Methods
Definition of reward model bias
User prompt generation
Automatic pipeline for bias discovery
Results
Sanity Check: Evaluating format biases
Biases of Skywork-V2-8B
Comparing different pipeline configurations
Recall
Related work
Limitations
Conclusion
Another definition of reward model bias
User prompt generation
...and 15 more sections

Figures (10)

Figure 1: Between the original response with normal spacing, and the alternative response with extra whitespace characters between words, the RM mistakenly prefers the latter, disagreeing with the LLM judge.
Figure 2: Illustration of our pipeline. Each circle represents a population of candidate biases at different stages.
Figure 3: Format biases of Skywork-V2-8B on two user prompt datasets. The numbers below are the mean reward difference (i.e. bias strength) and the 95% CI, pooled across all three rewriters. The red bar in the boxplot is the mean.
Figure 4: Comparison of visualization and DABS metrics.
Figure 5: Recall rates under different conditions. The confidence intervals are 95% Wilson CIs with $n=10$, hence the wide error bars.
...and 5 more figures

Automatically Finding Reward Model Biases

TL;DR

Abstract

Automatically Finding Reward Model Biases

Authors

TL;DR

Abstract

Table of Contents

Figures (10)