Table of Contents
Fetching ...

Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

Léo Labat, Etienne Ollion, François Yvon

TL;DR

This study investigates whether multilingual LLMs answer value-laden MCQs consistently across languages or reveal language-specific preferences. It introduces MEVS, a human-translated corpus aligned across 8 European languages, and evaluates over 30 open-weight models with extensive prompt-variation (answer order, tail, symbol, language) using MCP with log-probabilities. The results show that while the largest instruction-tuned models are more consistent overall, cross-language behavior remains highly question-dependent, with some items yielding cross-language agreement and others showing language-specific divergence. Mutual information analyses reveal language as a strong predictor of responses for several questions, suggesting selective language sensitivity rather than a universal polyglot or multitudes model. The work provides a valuable dataset and framework for studying LLM alignment with values across languages and highlights directions for deeper mechanistic and scalability analyses.

Abstract

Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.

Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

TL;DR

This study investigates whether multilingual LLMs answer value-laden MCQs consistently across languages or reveal language-specific preferences. It introduces MEVS, a human-translated corpus aligned across 8 European languages, and evaluates over 30 open-weight models with extensive prompt-variation (answer order, tail, symbol, language) using MCP with log-probabilities. The results show that while the largest instruction-tuned models are more consistent overall, cross-language behavior remains highly question-dependent, with some items yielding cross-language agreement and others showing language-specific divergence. Mutual information analyses reveal language as a strong predictor of responses for several questions, suggesting selective language sensitivity rather than a universal polyglot or multitudes model. The work provides a valuable dataset and framework for studying LLM alignment with values across languages and highlights directions for deeper mechanistic and scalability analyses.

Abstract

Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
Paper Structure (39 sections, 7 equations, 5 figures, 7 tables)

This paper contains 39 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Two Competing Hypotheses The polyglot hypothesis assumes that a model has a consistent response when asked the same question in different languages. The multitude hypothesis posits that the model's answer depends on the language of the question, as if it contained a multitude of monolingual models expressing divergent values through a single model.
  • Figure 2: Entropies of Response Distributions by Model $\times$ Question (3-Point Likert Scale)
  • Figure 3: Mutual Information (in Bits) between Response (3-Point Scale) and Language by Question and Model
  • Figure 4: Plurality Answer Counts Across 8 Questions on 3-Point Scales. Q13, Q44, Q17 & Q43 have the highest $I(R;L)$ while Q41, Q118, Q42 and Q120 have the lowest $I(R;L)$. Green: Agree, White: Neutral, Red: Disagree. For all questions, see Appendix \ref{['appendix:lang']}.
  • Figure 5: Counts of Language Plurality Answers for the Most Consistent Models on a 3-Point Scale By Question. Green: Agree, White: Neutral, Red: Disagree. Reading example: for Llama-3.1-70B-Instruct, 7 languages have Agree as their most frequent response to Q13, while only one has Disagree as its plurality answer and no language has Neutral as its dominant answer.