Table of Contents
Fetching ...

Mitigating Self-Preference by Authorship Obfuscation

Taslim Mahbub, Shi Feng

TL;DR

The paper tackles harmful self-preference in LM-based judges, where judges favor their own outputs over others. It tests the self-recognition hypothesis by applying black-box perturbations (notably synonym replacement and paraphrasing) to obfuscate authorship in pairwise evaluations on the QuALITY long-document QA benchmark, plus a coding-task extension. Key findings show that simple synonym substitutions can reduce self-recognition and harmful self-preference, but complete mitigation is challenging as bias also arises from deeper semantic cues; paraphrasing can even increase bias in some cases. The work highlights practical strategies for making LM-based judgments more reliable, such as targeted perturbations and ensemble judging, while acknowledging fundamental limits to fully decoupling judgments from prior beliefs.

Abstract

Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges' ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also uncover fundamental challenges to eliminating the bias: when we extrapolate our perturbations to a more complete neutralization of stylistic differences between the evaluation candidates, self-preference recovers. Our findings suggest that self-recognition and self-preference can happen on many semantic levels, and complete mitigation remains challenging despite promising initial results.

Mitigating Self-Preference by Authorship Obfuscation

TL;DR

The paper tackles harmful self-preference in LM-based judges, where judges favor their own outputs over others. It tests the self-recognition hypothesis by applying black-box perturbations (notably synonym replacement and paraphrasing) to obfuscate authorship in pairwise evaluations on the QuALITY long-document QA benchmark, plus a coding-task extension. Key findings show that simple synonym substitutions can reduce self-recognition and harmful self-preference, but complete mitigation is challenging as bias also arises from deeper semantic cues; paraphrasing can even increase bias in some cases. The work highlights practical strategies for making LM-based judgments more reliable, such as targeted perturbations and ensemble judging, while acknowledging fundamental limits to fully decoupling judgments from prior beliefs.

Abstract

Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges' ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also uncover fundamental challenges to eliminating the bias: when we extrapolate our perturbations to a more complete neutralization of stylistic differences between the evaluation candidates, self-preference recovers. Our findings suggest that self-recognition and self-preference can happen on many semantic levels, and complete mitigation remains challenging despite promising initial results.

Paper Structure

This paper contains 27 sections, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Harmful self-preference of DeepSeek-V3 in pairwise comparison. When we perturb the answer pair to reduce superficial cues (e.g., word choice) that the judge can use to infer model identity, harmful self-preference also reduces. However, the effect is reversed when we eliminate identity cues by paraphrasing.
  • Figure 2: Bigger models are more accurate at both answering questions and judging.
  • Figure 3: Capable models are less sensitive to the order of evaluation and make fewer ambiguous decisions.
  • Figure 4: Win rate of each model against all others as judged by the groundtruth compared to the model itself. Stronger models overestimate their accuracy; weaker models do the opposite.
  • Figure 5: Strong models are significantly less accurate on examples where their own answers are wrong (harmful cases), but have a higher overall judge accuracy.
  • ...and 11 more figures