Table of Contents
Fetching ...

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan K. Reddy

TL;DR

The paper introduces SATA-Bench, a large, human-validated benchmark for Select All That Apply questions across six domains, revealing persistent gaps in multi-answer reasoning even among state-of-the-art models. It formalizes three systematic biases—unselection, count, and speculation—and demonstrates that traditional prompting and single-answer training are insufficient for SATA tasks. To address these challenges, it proposes Choice Funnel, a decoding strategy that uses token debiasing, abstention, and adaptive stopping to guide models toward complete and correct selections, achieving up to a 29-point gain in exact-match accuracy and a 64% reduction in inference cost. The work also provides comprehensive ablations, a PriDe-based adaptation for debiasing, and an open-source release of SATA-Bench and Choice Funnel to advance robust, multi-answer reasoning in real-world applications.

Abstract

Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

TL;DR

The paper introduces SATA-Bench, a large, human-validated benchmark for Select All That Apply questions across six domains, revealing persistent gaps in multi-answer reasoning even among state-of-the-art models. It formalizes three systematic biases—unselection, count, and speculation—and demonstrates that traditional prompting and single-answer training are insufficient for SATA tasks. To address these challenges, it proposes Choice Funnel, a decoding strategy that uses token debiasing, abstention, and adaptive stopping to guide models toward complete and correct selections, achieving up to a 29-point gain in exact-match accuracy and a 64% reduction in inference cost. The work also provides comprehensive ablations, a PriDe-based adaptation for debiasing, and an open-source release of SATA-Bench and Choice Funnel to advance robust, multi-answer reasoning in real-world applications.

Abstract

Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.

Paper Structure

This paper contains 71 sections, 4 equations, 14 figures, 17 tables, 1 algorithm.

Figures (14)

  • Figure 1: Representative example of an LLM failure on a SATA (Select All That Apply) question. Models often miss valid answers due to unselection, count, and speculation biases. Gemini speculates in this question while GPT-4o underselects. Other models may have unselection bias over C.
  • Figure 2: SATA-Bench Evaluation Dataset Overview. SATA-Bench covers a diverse set of topics and achieves a balance between readability and difficulty (measured by confusion score). d1: Reading Comprehension, d2: Toxicity, d3: News, d4: Biomedicine, d5: Laws, and d6: Events.
  • Figure 3: SATA-Bench Data Curation Process. The source data is converted to SATA format and then filtered for readability, diversity (via question similarity), difficulty (via confusion scoring), and clarity (via human validation). Additional dataset-specific transformation steps are described in Appendix \ref{['app:adaptSATA']}.
  • Figure 4: Representative examples of questions from various data sources used to construct SATA-Bench.
  • Figure 5: Confusion score distribution across all questions before filtering. d1: Reading Comprehension, d2: Toxicity, d3: News, d4: Biomedicine, d5: Laws, and d6: Events.
  • ...and 9 more figures