Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models
Sheng-Lun Wei, Cheng-Kuang Wu, Hen-Hsen Huang, Hsin-Hsi Chen
TL;DR
This work investigates selection biases in large language models when required to pick the best option from ordered MCQ prompts in zero-shot settings. It characterizes token and order sensitivities, introduces a Fluctuation Rate to quantify instability, and reveals that modern models often exhibit stronger order sensitivity than token sensitivity, with variability across model families and tasks. The authors propose cost-effective mitigation strategies: gray-box probability weighting and calibration, and a black-box two-hop strategy, demonstrating substantial robustness gains across multiple benchmarks (including ARC, HellaSwag, MMLU, Winogrande, MathQA, OpenBookQA) with modest additional cost. The findings illuminate task difficulty as a driver of sensitivity and provide practical pathways to more stable MCQ selection in real-world applications. Collectively, the paper offers actionable mitigation techniques and a detailed landscape of sensitivity across models and tasks, informing more reliable LLM deployments for selection problems.
Abstract
In this paper, we investigate the phenomena of "selection biases" in Large Language Models (LLMs), focusing on problems where models are tasked with choosing the optimal option from an ordered sequence. We delve into biases related to option order and token usage, which significantly impact LLMs' decision-making processes. We also quantify the impact of these biases through an extensive empirical analysis across multiple models and tasks. Furthermore, we propose mitigation strategies to enhance model performance. Our key contributions are threefold: 1) Precisely quantifying the influence of option order and token on LLMs, 2) Developing strategies to mitigate the impact of token and order sensitivity to enhance robustness, and 3) Offering a detailed analysis of sensitivity across models and tasks, which informs the creation of more stable and reliable LLM applications for selection problems.
