Table of Contents
Fetching ...

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

Aisha Khatun, Daniel G. Brown

TL;DR

This work systematically probes the MCQ capabilities of 26 small open-source LLMs, revealing widespread failures to understand MCQ tasks and strong dependence on answer ordering. Using the TruthEval dataset and dual evaluation pipelines (text-output heuristics and first-token probabilities) with randomized option orders, the study demonstrates that most models either ignore the task or rely on position biases, with only a few models showing partial order-independence and task understanding. The findings stress caution for MCQ-based evaluation and deployment of small LLMs in real-world settings, and identify promising directions (e.g., certain Mistral models) for improving instruction-following robustness. Overall, the paper highlights the need for robust task-understanding testing and better evaluation metrics when using MCQs to assess LLM capabilities across domains.

Abstract

The widespread adoption of Large Language Models (LLMs) has become commonplace, particularly with the emergence of open-source models. More importantly, smaller models are well-suited for integration into consumer devices and are frequently employed either as standalone solutions or as subroutines in various AI tasks. Despite their ubiquitous use, there is no systematic analysis of their specific capabilities and limitations. In this study, we tackle one of the most widely used tasks - answering Multiple Choice Question (MCQ). We analyze 26 small open-source models and find that 65% of the models do not understand the task, only 4 models properly select an answer from the given choices, and only 5 of these models are choice order independent. These results are rather alarming given the extensive use of MCQ tests with these models. We recommend exercising caution and testing task understanding before using MCQ to evaluate LLMs in any field whatsoever.

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

TL;DR

This work systematically probes the MCQ capabilities of 26 small open-source LLMs, revealing widespread failures to understand MCQ tasks and strong dependence on answer ordering. Using the TruthEval dataset and dual evaluation pipelines (text-output heuristics and first-token probabilities) with randomized option orders, the study demonstrates that most models either ignore the task or rely on position biases, with only a few models showing partial order-independence and task understanding. The findings stress caution for MCQ-based evaluation and deployment of small LLMs in real-world settings, and identify promising directions (e.g., certain Mistral models) for improving instruction-following robustness. Overall, the paper highlights the need for robust task-understanding testing and better evaluation metrics when using MCQs to assess LLM capabilities across domains.

Abstract

The widespread adoption of Large Language Models (LLMs) has become commonplace, particularly with the emergence of open-source models. More importantly, smaller models are well-suited for integration into consumer devices and are frequently employed either as standalone solutions or as subroutines in various AI tasks. Despite their ubiquitous use, there is no systematic analysis of their specific capabilities and limitations. In this study, we tackle one of the most widely used tasks - answering Multiple Choice Question (MCQ). We analyze 26 small open-source models and find that 65% of the models do not understand the task, only 4 models properly select an answer from the given choices, and only 5 of these models are choice order independent. These results are rather alarming given the extensive use of MCQ tests with these models. We recommend exercising caution and testing task understanding before using MCQ to evaluate LLMs in any field whatsoever.
Paper Structure (29 sections, 9 figures, 7 tables)

This paper contains 29 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Distribution of responses across all prompts for Text Response. The 19 models with Bad Output and only 'A' output are omitted. All models are shown in Appendix \ref{['apx:randomization_text_response']}.
  • Figure 2: Distribution of responses across all models and prompts for Probability approach. The 17 models with only 'A' output are omitted. All models are shown in Appendix \ref{['apx:randomization_probability']}.
  • Figure 3: Distribution of sum of the probabilities of A, B, C, and D tokens across all prompts. The 17 models with zero probabilities are omitted. All models are shown in Appendix \ref{['apx:aggregated_probability']}.
  • Figure 4: Distribution of responses across all models and prompts for Text Response.
  • Figure 5: Distribution of responses across all models and prompts for Probability approach.
  • ...and 4 more figures