Table of Contents
Fetching ...

Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models

Sheng-Lun Wei, Cheng-Kuang Wu, Hen-Hsen Huang, Hsin-Hsi Chen

TL;DR

This work investigates selection biases in large language models when required to pick the best option from ordered MCQ prompts in zero-shot settings. It characterizes token and order sensitivities, introduces a Fluctuation Rate to quantify instability, and reveals that modern models often exhibit stronger order sensitivity than token sensitivity, with variability across model families and tasks. The authors propose cost-effective mitigation strategies: gray-box probability weighting and calibration, and a black-box two-hop strategy, demonstrating substantial robustness gains across multiple benchmarks (including ARC, HellaSwag, MMLU, Winogrande, MathQA, OpenBookQA) with modest additional cost. The findings illuminate task difficulty as a driver of sensitivity and provide practical pathways to more stable MCQ selection in real-world applications. Collectively, the paper offers actionable mitigation techniques and a detailed landscape of sensitivity across models and tasks, informing more reliable LLM deployments for selection problems.

Abstract

In this paper, we investigate the phenomena of "selection biases" in Large Language Models (LLMs), focusing on problems where models are tasked with choosing the optimal option from an ordered sequence. We delve into biases related to option order and token usage, which significantly impact LLMs' decision-making processes. We also quantify the impact of these biases through an extensive empirical analysis across multiple models and tasks. Furthermore, we propose mitigation strategies to enhance model performance. Our key contributions are threefold: 1) Precisely quantifying the influence of option order and token on LLMs, 2) Developing strategies to mitigate the impact of token and order sensitivity to enhance robustness, and 3) Offering a detailed analysis of sensitivity across models and tasks, which informs the creation of more stable and reliable LLM applications for selection problems.

Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models

TL;DR

This work investigates selection biases in large language models when required to pick the best option from ordered MCQ prompts in zero-shot settings. It characterizes token and order sensitivities, introduces a Fluctuation Rate to quantify instability, and reveals that modern models often exhibit stronger order sensitivity than token sensitivity, with variability across model families and tasks. The authors propose cost-effective mitigation strategies: gray-box probability weighting and calibration, and a black-box two-hop strategy, demonstrating substantial robustness gains across multiple benchmarks (including ARC, HellaSwag, MMLU, Winogrande, MathQA, OpenBookQA) with modest additional cost. The findings illuminate task difficulty as a driver of sensitivity and provide practical pathways to more stable MCQ selection in real-world applications. Collectively, the paper offers actionable mitigation techniques and a detailed landscape of sensitivity across models and tasks, informing more reliable LLM deployments for selection problems.

Abstract

In this paper, we investigate the phenomena of "selection biases" in Large Language Models (LLMs), focusing on problems where models are tasked with choosing the optimal option from an ordered sequence. We delve into biases related to option order and token usage, which significantly impact LLMs' decision-making processes. We also quantify the impact of these biases through an extensive empirical analysis across multiple models and tasks. Furthermore, we propose mitigation strategies to enhance model performance. Our key contributions are threefold: 1) Precisely quantifying the influence of option order and token on LLMs, 2) Developing strategies to mitigate the impact of token and order sensitivity to enhance robustness, and 3) Offering a detailed analysis of sensitivity across models and tasks, which informs the creation of more stable and reliable LLM applications for selection problems.
Paper Structure (30 sections, 20 equations, 8 figures, 18 tables)

This paper contains 30 sections, 20 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Correlation between model accuracy and fluctuation rates under different sensitivity settings: Token, Order, and Both. Including linear regression lines for each setting, alongside slope and $R^2$ values, to clearly show the relation between model performance and fluctuation rates.
  • Figure 2: Accuracy Difference Distribution Across 57 MMLU Subtasks For The GPT-3.5 Model in the Gray-Box Scenario: Subtasks are sorted by the difference in accuracy from low to high, indicating that subtasks towards the right benefit more from our methodology. Improvements are marked in green, whereas declines in performance are highlighted in red. The top three diagrams present outcomes from the probability weighting method across three sensitivity settings, while the bottom three diagrams illustrate the effects of the probability calibration method.
  • Figure 3: Prompt template illustrating the token sensitivity setting for each question $q$. The upper part represents $r_{forward}$, and the lower part corresponds to $r_{backward}$. Option symbols are highlighted in blue, while both the question text and option contents are highlighted in orange. Other text shown in black remains consistent across questions.
  • Figure 4: Prompt template illustrating the order sensitivity setting for each question $q$. The upper part represents $r_{forward}$, and the lower part corresponds to $r_{backward}$. Option symbols are highlighted in blue, while both the question text and option contents are highlighted in orange. Other text shown in black remains consistent across questions.
  • Figure 5: Prompt template illustrating the both sensitivity setting for each question $q$. The upper part represents $r_{forward}$, and the lower part corresponds to $r_{backward}$. Option symbols are highlighted in blue, while both the question text and option contents are highlighted in orange. Other text shown in black remains consistent across questions.
  • ...and 3 more figures