Table of Contents
Fetching ...

Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach

Blessed Guda, Lawrence Francis, Gabrial Zencha Ashungafac, Carlee Joe-Wong, Moise Busogi

TL;DR

The paper tackles selection bias in MCQ evaluation of LLMs, where model answers skew with respect to option order rather than content. It introduces an unsupervised Permutation Bias Metric (PBM) that assesses prediction consistency across all answer permutations, and two mitigation strategies: Batch Question-Context KV caching (BaQCKV) for efficient majority voting and an unsupervised LoRA-1 fine-tuning method. Empirical results across TeleQnA, MedMCQ, QASC, and ARC show PBM tracks problem difficulty, we can achieve zero bias with MV+BaQCKV, and substantial bias reductions with LoRA-1 alongside large compute savings and transferability across datasets. Together, these contributions yield a scalable framework for reliable MCQ evaluation and deployment of LLMs in bias-sensitive settings.

Abstract

Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs). However, LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content. This bias undermines the reliability of MCQ as an evaluation framework. Most existing selection bias metrics require answer labels and measure divergences between prediction and answer distributions, but do not fully capture the consistency of a model's predictions across different orderings of answer choices. Existing selection bias mitigation strategies have notable limitations: majority voting, though effective, is computationally prohibitive; calibration-based methods require validation sets and often fail to generalize across datasets. To address these gaps, we propose three key contributions: (1) a new unsupervised label-free Permutation Bias Metric (PBM) that directly quantifies inconsistencies in model predictions across answer permutations, providing a more precise measure of selection bias, (2) an efficient majority voting approach called Batch Question-Context KV caching (BaQCKV), to significantly reduce computational costs while preserving bias mitigation effectiveness, and (3) an unsupervised Low-Rank Adaptation (LoRA-1) fine-tuning strategy based on our proposed metric and the BaQCKV that mitigates selection bias, providing a computationally efficient alternative that maintains model generalizability. Experiments across multiple MCQ benchmarks demonstrate that our approaches reduce bias, increasing consistency in accuracy while minimizing computational costs.

Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach

TL;DR

The paper tackles selection bias in MCQ evaluation of LLMs, where model answers skew with respect to option order rather than content. It introduces an unsupervised Permutation Bias Metric (PBM) that assesses prediction consistency across all answer permutations, and two mitigation strategies: Batch Question-Context KV caching (BaQCKV) for efficient majority voting and an unsupervised LoRA-1 fine-tuning method. Empirical results across TeleQnA, MedMCQ, QASC, and ARC show PBM tracks problem difficulty, we can achieve zero bias with MV+BaQCKV, and substantial bias reductions with LoRA-1 alongside large compute savings and transferability across datasets. Together, these contributions yield a scalable framework for reliable MCQ evaluation and deployment of LLMs in bias-sensitive settings.

Abstract

Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs). However, LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content. This bias undermines the reliability of MCQ as an evaluation framework. Most existing selection bias metrics require answer labels and measure divergences between prediction and answer distributions, but do not fully capture the consistency of a model's predictions across different orderings of answer choices. Existing selection bias mitigation strategies have notable limitations: majority voting, though effective, is computationally prohibitive; calibration-based methods require validation sets and often fail to generalize across datasets. To address these gaps, we propose three key contributions: (1) a new unsupervised label-free Permutation Bias Metric (PBM) that directly quantifies inconsistencies in model predictions across answer permutations, providing a more precise measure of selection bias, (2) an efficient majority voting approach called Batch Question-Context KV caching (BaQCKV), to significantly reduce computational costs while preserving bias mitigation effectiveness, and (3) an unsupervised Low-Rank Adaptation (LoRA-1) fine-tuning strategy based on our proposed metric and the BaQCKV that mitigates selection bias, providing a computationally efficient alternative that maintains model generalizability. Experiments across multiple MCQ benchmarks demonstrate that our approaches reduce bias, increasing consistency in accuracy while minimizing computational costs.

Paper Structure

This paper contains 27 sections, 20 equations, 2 figures, 21 tables, 1 algorithm.

Figures (2)

  • Figure 1: Visualization of bias-related behaviors across models and strategies.
  • Figure 2: Attention scores for the last token in the last layer of the Llama model across different prompt permutations, shown for two transformer heads.

Theorems & Definitions (1)

  • Definition 1: Permutation Bias Metric – PBM