Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods
Polina Tsvilodub, Hening Wang, Sharon Grosch, Michael Franke
TL;DR
This paper investigates whether predictions from large language models for multiple-choice tasks are stable when the underlying scoring/linking method is varied, using a case study in pragmatic language interpretation. It systematically compares five answer-selection methods—Free Generation, String Scoring, Label Scoring, Rating Aggregation, and Embedding Similarity—across four LLMs on a dataset with seven pragmatic-phenomenon conditions. The findings show substantial variability in both accuracy and goodness-of-fit to human data depending on the model and method, with no single method robust across all models; label scoring often performs best, while rating aggregation performs poorly. The study highlights the need for careful reporting and robustness checks to mitigate researcher degrees of freedom, and proposes future work to broaden datasets, decoding schemes, and distributional predictions to improve reproducibility and interpretation.
Abstract
This paper systematically compares different methods of deriving item-level predictions of language models for multiple-choice tasks. It compares scoring methods for answer options based on free generation of responses, various probability-based scores, a Likert-scale style rating method, and embedding similarity. In a case study on pragmatic language interpretation, we find that LLM predictions are not robust under variation of method choice, both within a single LLM and across different LLMs. As this variability entails pronounced researcher degrees of freedom in reporting results, knowledge of the variability is crucial to secure robustness of results and research integrity.
