Table of Contents
Fetching ...

Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Jou Barzdukas, Simon Fu, Narmeen Oozeer

TL;DR

This work investigates self-preference bias in LLM evaluators, a danger to fair and reliable judgments in tasks like preference tuning and routing.It introduces a formal measurement framework and a dedicated XSUM-based evaluation set that distinguishes illegitimate, legitimate, and unbiased preferences, enabling robust testing of mitigation methods.Two inference-time steering techniques—Contrastive Activation Addition (CAA) and gradient-based activation optimization—are proposed to edit activations and steer judgments without retraining, outperforming prompting and DPO baselines in reducing illegitimate bias.Results show up to 97% reduction in illegitimate self-preference for several vectors, but reveal instability for legitimate self-preference and unbiased agreement, suggesting nonlinear or multi-direction activation representations and signaling the need for more robust interventions.

Abstract

Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97\%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.

Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

TL;DR

This work investigates self-preference bias in LLM evaluators, a danger to fair and reliable judgments in tasks like preference tuning and routing.It introduces a formal measurement framework and a dedicated XSUM-based evaluation set that distinguishes illegitimate, legitimate, and unbiased preferences, enabling robust testing of mitigation methods.Two inference-time steering techniques—Contrastive Activation Addition (CAA) and gradient-based activation optimization—are proposed to edit activations and steer judgments without retraining, outperforming prompting and DPO baselines in reducing illegitimate bias.Results show up to 97% reduction in illegitimate self-preference for several vectors, but reveal instability for legitimate self-preference and unbiased agreement, suggesting nonlinear or multi-direction activation representations and signaling the need for more robust interventions.

Abstract

Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97\%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.

Paper Structure

This paper contains 29 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: A steering vector fits a self-preferring model around an aligned mean in blind (left) and aware (right) pairwise preference tests, suggesting the representation of self-preference can be derived from linear space. Steering on layer 14 with a multiplier of 0.5 (CAA) and 0.1 (Optimization).
  • Figure 2: Probability of the self-evaluating model $J$ choosing the comparison model $K$'s summary on the y-axis, and multipliers on the x-axis. This plot is for the subset of examples in which $J$ thinks its summary is better and the gold judges $\{G_1, \dots, G_n\}$ think that $K$'s summary is better.
  • Figure 3: Probability of the self-evaluating model $J$ choosing the comparison model $K$'s summary on the y-axis. This plot is for the subset of examples in which $J$ agrees with the gold judges $\{G_1, \dots, G_n\}$ that $K$'s summary is best.
  • Figure 4: Probability of the self-evaluating model $J$ choosing its own summary on the y-axis, and multipliers on the x-axis. This plot is for the subset of examples in which the self-evaluating model $J$ thinks that its summary is better and the gold judges $\{G_1, \dots, G_n\}$ agree.
  • Figure 5: Plot of the distribution of a model’s probability of selecting its own output on the APPS dataset in a pairwise setting. LLaMA markedly overestimates itself, with its mean self-preference far above the impartial judge score.