Table of Contents
Fetching ...

RBCorr: Response Bias Correction in Language Models

Om Bhatt, Anna A. Ivanova

TL;DR

This work tackles response bias in fixed-response evaluations of language models by introducing RBCorr, a low-cost, LogProbs-based calibration that debiases outputs with a small held-out, class-balanced calibration set. By extracting last-layer LogProbs, measuring bias with Total Variation Distance, and applying mean-normalization per option, RBCorr reduces bias while preserving or improving accuracy across diverse yes-no, entailment, and multi-choice tasks. Comparative analyses against Contextual Calibration, Batch Calibration, and PriDe show RBCorr often yields the largest bias reductions and substantial accuracy gains, particularly for smaller models, though bias patterns are highly dependent on model, dataset, and prompt configuration, limiting transferability. These findings support using calibrated LogProbs as both a practical evaluation tool and a means to reveal latent capabilities in LM benchmarks, while acknowledging limitations to open-weight models and the need for condition-specific calibration. The approach has practical implications for fairer benchmarking and targeted debiasing in real-world deployments of lightweight language systems.

Abstract

Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy ($\texttt{RBCorr}$) and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that $\texttt{RBCorr}$ effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, $\texttt{RBCorr}$ is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.

RBCorr: Response Bias Correction in Language Models

TL;DR

This work tackles response bias in fixed-response evaluations of language models by introducing RBCorr, a low-cost, LogProbs-based calibration that debiases outputs with a small held-out, class-balanced calibration set. By extracting last-layer LogProbs, measuring bias with Total Variation Distance, and applying mean-normalization per option, RBCorr reduces bias while preserving or improving accuracy across diverse yes-no, entailment, and multi-choice tasks. Comparative analyses against Contextual Calibration, Batch Calibration, and PriDe show RBCorr often yields the largest bias reductions and substantial accuracy gains, particularly for smaller models, though bias patterns are highly dependent on model, dataset, and prompt configuration, limiting transferability. These findings support using calibrated LogProbs as both a practical evaluation tool and a means to reveal latent capabilities in LM benchmarks, while acknowledging limitations to open-weight models and the need for condition-specific calibration. The approach has practical implications for fairer benchmarking and targeted debiasing in real-world deployments of lightweight language systems.

Abstract

Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy () and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.
Paper Structure (24 sections, 1 equation, 4 figures, 3 tables)

This paper contains 24 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Baseline model response label distribution for all models on all datasets, using the fewshot prompt format. Red dotted lines indicate uniform distribution intervals (i.e. the dataset's ground-truth label distribution.)
  • Figure 2: Accuracy achieved by applying RBCorr at multiple calibration set sizes, averaged across all models. Shading shows the interquartile range of accuracies achieved after performing 100 separate iterations of the correction process on each dataset. Horizontal dashed line shows average baseline accuracy across all models. We discuss our method's results achieved using a set size of 100 (marked with vertical line) as a realistic setup.
  • Figure 3: Scatterplots showing per-model bias (TVD; $\downarrow$ is better) and accuracy (%; $\uparrow$ is better) before [$\bullet$] vs. after [$\times$] applying RBCorr correction. We show results on one dataset per each question-type; the bottom-right plot shows results averaged across all ten datasets.
  • Figure 4: Heatmaps showing transfer correction performance for all three transfer modalities using three specific model-dataset-prompt setups. Gradient indicates accuracy change after applying correction relative to baseline model accuracy.