A Study on Large Language Models' Limitations in Multiple-Choice Question Answering
Aisha Khatun, Daniel G. Brown
TL;DR
This work systematically probes the MCQ capabilities of 26 small open-source LLMs, revealing widespread failures to understand MCQ tasks and strong dependence on answer ordering. Using the TruthEval dataset and dual evaluation pipelines (text-output heuristics and first-token probabilities) with randomized option orders, the study demonstrates that most models either ignore the task or rely on position biases, with only a few models showing partial order-independence and task understanding. The findings stress caution for MCQ-based evaluation and deployment of small LLMs in real-world settings, and identify promising directions (e.g., certain Mistral models) for improving instruction-following robustness. Overall, the paper highlights the need for robust task-understanding testing and better evaluation metrics when using MCQs to assess LLM capabilities across domains.
Abstract
The widespread adoption of Large Language Models (LLMs) has become commonplace, particularly with the emergence of open-source models. More importantly, smaller models are well-suited for integration into consumer devices and are frequently employed either as standalone solutions or as subroutines in various AI tasks. Despite their ubiquitous use, there is no systematic analysis of their specific capabilities and limitations. In this study, we tackle one of the most widely used tasks - answering Multiple Choice Question (MCQ). We analyze 26 small open-source models and find that 65% of the models do not understand the task, only 4 models properly select an answer from the given choices, and only 5 of these models are choice order independent. These results are rather alarming given the extensive use of MCQ tests with these models. We recommend exercising caution and testing task understanding before using MCQ to evaluate LLMs in any field whatsoever.
