Table of Contents
Fetching ...

Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods

Polina Tsvilodub, Hening Wang, Sharon Grosch, Michael Franke

TL;DR

This paper investigates whether predictions from large language models for multiple-choice tasks are stable when the underlying scoring/linking method is varied, using a case study in pragmatic language interpretation. It systematically compares five answer-selection methods—Free Generation, String Scoring, Label Scoring, Rating Aggregation, and Embedding Similarity—across four LLMs on a dataset with seven pragmatic-phenomenon conditions. The findings show substantial variability in both accuracy and goodness-of-fit to human data depending on the model and method, with no single method robust across all models; label scoring often performs best, while rating aggregation performs poorly. The study highlights the need for careful reporting and robustness checks to mitigate researcher degrees of freedom, and proposes future work to broaden datasets, decoding schemes, and distributional predictions to improve reproducibility and interpretation.

Abstract

This paper systematically compares different methods of deriving item-level predictions of language models for multiple-choice tasks. It compares scoring methods for answer options based on free generation of responses, various probability-based scores, a Likert-scale style rating method, and embedding similarity. In a case study on pragmatic language interpretation, we find that LLM predictions are not robust under variation of method choice, both within a single LLM and across different LLMs. As this variability entails pronounced researcher degrees of freedom in reporting results, knowledge of the variability is crucial to secure robustness of results and research integrity.

Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods

TL;DR

This paper investigates whether predictions from large language models for multiple-choice tasks are stable when the underlying scoring/linking method is varied, using a case study in pragmatic language interpretation. It systematically compares five answer-selection methods—Free Generation, String Scoring, Label Scoring, Rating Aggregation, and Embedding Similarity—across four LLMs on a dataset with seven pragmatic-phenomenon conditions. The findings show substantial variability in both accuracy and goodness-of-fit to human data depending on the model and method, with no single method robust across all models; label scoring often performs best, while rating aggregation performs poorly. The study highlights the need for careful reporting and robustness checks to mitigate researcher degrees of freedom, and proposes future work to broaden datasets, decoding schemes, and distributional predictions to improve reproducibility and interpretation.

Abstract

This paper systematically compares different methods of deriving item-level predictions of language models for multiple-choice tasks. It compares scoring methods for answer options based on free generation of responses, various probability-based scores, a Likert-scale style rating method, and embedding similarity. In a case study on pragmatic language interpretation, we find that LLM predictions are not robust under variation of method choice, both within a single LLM and across different LLMs. As this variability entails pronounced researcher degrees of freedom in reporting results, knowledge of the variability is crucial to secure robustness of results and research integrity.
Paper Structure (14 sections, 3 figures)

This paper contains 14 sections, 3 figures.

Figures (3)

  • Figure 1: Left: example of stimulus material (irony condition) for different methods. Sequences for which scores are retrieved are underlined. The trigger sentence is in boldface. $T$, $C_i$ and $Q$ are included in all methods' input prompts $I_i$. Colors indicate the text which is additionally used for the respective method. Right: set of all relevant scores.
  • Figure 2: Task accuracy for different methods and models, separately for different conditions. Bars show accuracy scores, which are averages over different versions of each method (e.g., random seeds or different scores). Horizontal bar shows accuracy expected from random guessing in each condition.
  • Figure 3: Results of methods comparison, based on accuracy (proportion of target choices) and goodness-of-fit to the human data. Different scores are displayed on the x-axis. For ease of visual comparison, goodness-of-fit is shown as difference in log-likelihood of the human data under the LLM predictor, compared against one of the worst performing models (higher is better).