Table of Contents
Fetching ...

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

Ziyin Zhang, Zhaokun Jiang, Lizhen Xu, Hongkun Hao, Rui Wang

TL;DR

GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions on GSM8K from 60 open-source models, is presented and MATH-MC, constructed from MATH, and PythonIO, a new program reasoning MC dataset constructed from HumanEval and MBPP are introduced.

Abstract

We present GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions on GSM8K from 60 open-source models. Through extensive experiments, we show that LLMs' performance on the MC version of this popular benchmark is strongly correlated with their performance on the original version and is quite robust to distractor choices and option orders, while the evaluation time is reduced by a factor of up to 30. Following similar procedures, we introduce MATH-MC, constructed from MATH, and PythonIO, a new program reasoning MC dataset constructed from HumanEval and MBPP. Experimental results indicate that LLMs' performance on these MC benchmarks leaves much room for improvement. Our data and code are available at https://github.com/Geralt-Targaryen/MC-Evaluation.

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

TL;DR

GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions on GSM8K from 60 open-source models, is presented and MATH-MC, constructed from MATH, and PythonIO, a new program reasoning MC dataset constructed from HumanEval and MBPP are introduced.

Abstract

We present GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions on GSM8K from 60 open-source models. Through extensive experiments, we show that LLMs' performance on the MC version of this popular benchmark is strongly correlated with their performance on the original version and is quite robust to distractor choices and option orders, while the evaluation time is reduced by a factor of up to 30. Following similar procedures, we introduce MATH-MC, constructed from MATH, and PythonIO, a new program reasoning MC dataset constructed from HumanEval and MBPP. Experimental results indicate that LLMs' performance on these MC benchmarks leaves much room for improvement. Our data and code are available at https://github.com/Geralt-Targaryen/MC-Evaluation.
Paper Structure (15 sections, 12 figures, 5 tables)

This paper contains 15 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: An illustrative example of correct, incorrect, and invalid answers to one question from GSM8K (top). After converting to multiple-choice format (bottom), a prediction can always be extracted from model logits.
  • Figure 2: LLMs' answer distributions on GSM8K. Smaller models and aligned models tend to produce more invalid answers.
  • Figure 3: Comparison of answer distribution by aligned models with (top) and without (bottom) applying the instruction template.
  • Figure 4: Frequency of most likely output token over 1K training set problems on GSM-MC by base models (top) and aligned models (bottom). The ground truth answers of the 1K problems are balanced across the four options.
  • Figure 5: Model performance on GSM-MC (with the number of choices ranging from 2 to 8) and the original GSM8K. Each point is one model's score on GSM8K (x-axis) and one version of GSM-MC (y-axis), and the best-fitting line is given in red. The MC scores are strongly correlated with generation scores (Pearson correlation shown in each subplot's title), with a $p$-value less than 0.001 in all cases, indicating statistical significance.
  • ...and 7 more figures