Table of Contents
Fetching ...

XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs

Linyang He, Ercong Nie, Sukru Samet Dindar, Arsalan Firoozi, Adrian Florea, Van Nguyen, Corentin Puffay, Riki Shimizu, Haotian Ye, Jonathan Brennan, Helmut Schmid, Hinrich Schütze, Nima Mesgarani

TL;DR

XCOMPS presents a multilingual benchmark of conceptual minimal pairs across 17 languages to evaluate whether LLMs encode language-independent conceptual knowledge. The authors deploy three evaluation modalities—metalinguistic prompting, direct probability measurement, and neurolinguistic probing—to compare base, instruction-tuned, and distillation-based models, revealing language-dependent competence gaps and morphology-related effects. Key findings show cross-language variability, improvements in surface performance with instruction tuning but limited gains in internal competence, and gains in low-resource language competence via distillation with some task trade-offs; morphological complexity imposes deeper encoding requirements. The work provides a scalable framework for assessing multilingual conceptual reasoning and highlights gaps toward universal semantic representations.

Abstract

We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs' multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.

XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs

TL;DR

XCOMPS presents a multilingual benchmark of conceptual minimal pairs across 17 languages to evaluate whether LLMs encode language-independent conceptual knowledge. The authors deploy three evaluation modalities—metalinguistic prompting, direct probability measurement, and neurolinguistic probing—to compare base, instruction-tuned, and distillation-based models, revealing language-dependent competence gaps and morphology-related effects. Key findings show cross-language variability, improvements in surface performance with instruction tuning but limited gains in internal competence, and gains in low-resource language competence via distillation with some task trade-offs; morphological complexity imposes deeper encoding requirements. The work provides a scalable framework for assessing multilingual conceptual reasoning and highlights gaps toward universal semantic representations.

Abstract

We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs' multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.

Paper Structure

This paper contains 39 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Does conceptual knowledge (e.g., "toaster used for heating food") remain language-independent for LLMs?
  • Figure 2: Metalinguistic prompting (meta), direct probability measurement (direct) and minimal pair probing (neuro) results on XCOMPS. The meta method evaluate LLMs' language performance; neuro method evaluate LLMs' lnaguage competence and direct method fall in between. Languages are grouped according to morphological typology. Neuro-probing is a layer-wise method and here we use the max value across all layers to compare with Meta and Direct. Difference between base and instruct/distill models can be found in Figure \ref{['fig:diff']} in the appendix.
  • Figure 3: Linear correlation among meta, direct and neuro evaluation results for all four tasks. Linear correlation for each single task cound be found in Figure \ref{['fig:linear_cor_base']}, \ref{['fig:linear_cor_instruct']} and \ref{['fig:linear_cor_deepseek']} in the Appendix.
  • Figure 4: Averaged results across different language types. English results are dropped to make comparison more reliable among low-resource languages.
  • Figure 5: Layer-wise minimal pair probing results on XCOMPS. Layer-wise perf. difference between base and instruct/distill models can be found in Figure \ref{['fig:neuro_diff']} in the Appendix.
  • ...and 9 more figures