Multiple-Choice Questions are Efficient and Robust LLM Evaluators

Ziyin Zhang; Zhaokun Jiang; Lizhen Xu; Hongkun Hao; Rui Wang

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

Ziyin Zhang, Zhaokun Jiang, Lizhen Xu, Hongkun Hao, Rui Wang

TL;DR

GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions on GSM8K from 60 open-source models, is presented and MATH-MC, constructed from MATH, and PythonIO, a new program reasoning MC dataset constructed from HumanEval and MBPP are introduced.

Abstract

We present GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions on GSM8K from 60 open-source models. Through extensive experiments, we show that LLMs' performance on the MC version of this popular benchmark is strongly correlated with their performance on the original version and is quite robust to distractor choices and option orders, while the evaluation time is reduced by a factor of up to 30. Following similar procedures, we introduce MATH-MC, constructed from MATH, and PythonIO, a new program reasoning MC dataset constructed from HumanEval and MBPP. Experimental results indicate that LLMs' performance on these MC benchmarks leaves much room for improvement. Our data and code are available at https://github.com/Geralt-Targaryen/MC-Evaluation.

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

TL;DR

Abstract

Paper Structure (15 sections, 12 figures, 5 tables)

This paper contains 15 sections, 12 figures, 5 tables.

Introduction
Related Work
Converting GSM8K to Multiple-Choice Format
A Closer Look at LLMs' Performance on GSM8K
Converting to Multiple-Choice Format
Can LLMs Understand Multiple-Choice Questions?
Rationality of MC Evaluation
Correlation between MC Evaluation and Open-Ended Evaluation
Robustness against Distractors and Choice Orders
MATH-MC and PythonIO
MATH
HumanEval and MBPP
Conclusion
Complete Results
Prompt Details and Sample Outputs

Figures (12)

Figure 1: An illustrative example of correct, incorrect, and invalid answers to one question from GSM8K (top). After converting to multiple-choice format (bottom), a prediction can always be extracted from model logits.
Figure 2: LLMs' answer distributions on GSM8K. Smaller models and aligned models tend to produce more invalid answers.
Figure 3: Comparison of answer distribution by aligned models with (top) and without (bottom) applying the instruction template.
Figure 4: Frequency of most likely output token over 1K training set problems on GSM-MC by base models (top) and aligned models (bottom). The ground truth answers of the 1K problems are balanced across the four options.
Figure 5: Model performance on GSM-MC (with the number of choices ranging from 2 to 8) and the original GSM8K. Each point is one model's score on GSM8K (x-axis) and one version of GSM-MC (y-axis), and the best-fitting line is given in red. The MC scores are strongly correlated with generation scores (Pearson correlation shown in each subplot's title), with a $p$-value less than 0.001 in all cases, indicating statistical significance.
...and 7 more figures

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

TL;DR

Abstract

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

Authors

TL;DR

Abstract

Table of Contents

Figures (12)