Table of Contents
Fetching ...

Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, Mark Yatskar

TL;DR

This work systematically investigates the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness, and proposes a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples.

Abstract

Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. However, the connection between training data artifacts and the miscalibrated preferences exhibited by models remains poorly understood. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with artificially magnified biases (skew), finding this preference occurs in $>60\%$ of instances, and model preferences show high miscalibration ($\approx 40\%$) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean $r_{\mathrm{human}} = -0.12$) but show moderately strong positive correlations with labels from a strong reward model (mean $r_{\mathrm{model}} = +0.36$), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Fine-tuning models with CDA reduces average miscalibration from $39.4\%$ to $32.5\%$ and average absolute skew difference from $20.5\%$ to $10.0\%$, while maintaining overall RewardBench performance, indicating that targeted debiasing can strengthen the reliability of preference models within standard alignment pipelines.

Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

TL;DR

This work systematically investigates the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness, and proposes a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples.

Abstract

Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. However, the connection between training data artifacts and the miscalibrated preferences exhibited by models remains poorly understood. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with artificially magnified biases (skew), finding this preference occurs in of instances, and model preferences show high miscalibration () compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean ) but show moderately strong positive correlations with labels from a strong reward model (mean ), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Fine-tuning models with CDA reduces average miscalibration from to and average absolute skew difference from to , while maintaining overall RewardBench performance, indicating that targeted debiasing can strengthen the reliability of preference models within standard alignment pipelines.

Paper Structure

This paper contains 44 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Examples of three idiosyncratic biases in language models: (1) Flattery: responses that excessively agree with the user; (2) Fluff: verbose, uninformative responses; and (3) Fog: vague responses that state many non-specific claims. Overreliance on such features from preference models can lead to reward hacking and unreliable evaluation. The complete list of biases explored in this work is in Table \ref{['tab:bias_examples']}.
  • Figure 2: Skew and calibration errors averaged across reward models (top row) and all LLM evaluators (bottom row) in favor of perturbed (biased) responses, compared with human preferences.
  • Figure 3: Contingency tables for each bias feature in the $2500$‑example training subset, showing co‑occurrence of bias presence in human‑chosen vs. human‑rejected responses. Anti-diagonal cells (top-right and bottom-left) quantify cases where the two responses differed on the feature.
  • Figure 4: Point–biserial correlations between bias presence and preference labels for each perturbation type. Circles show human judgments on the perturbation set ($r_{\mathrm{human}}$, x-axis) versus model judgments on the same ($r_{\mathrm{model}}$, y-axis). Triangles mark the corresponding human-bias presence correlations from the $2500$-example training data subset ($r_{\mathrm{human}}^{\mathrm{train}}$). The gray diagonal denotes perfect alignment; points above it indicate model bias overreliance.
  • Figure 5: Skew and calibration error of base reward models and reward models finetuned on counterfactual data, compared with human preferences.
  • ...and 2 more figures