SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions
Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan K. Reddy
TL;DR
The paper introduces SATA-Bench, a large, human-validated benchmark for Select All That Apply questions across six domains, revealing persistent gaps in multi-answer reasoning even among state-of-the-art models. It formalizes three systematic biases—unselection, count, and speculation—and demonstrates that traditional prompting and single-answer training are insufficient for SATA tasks. To address these challenges, it proposes Choice Funnel, a decoding strategy that uses token debiasing, abstention, and adaptive stopping to guide models toward complete and correct selections, achieving up to a 29-point gain in exact-match accuracy and a 64% reduction in inference cost. The work also provides comprehensive ablations, a PriDe-based adaptation for debiasing, and an open-source release of SATA-Bench and Choice Funnel to advance robust, multi-answer reasoning in real-world applications.
Abstract
Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.
