XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs
Linyang He, Ercong Nie, Sukru Samet Dindar, Arsalan Firoozi, Adrian Florea, Van Nguyen, Corentin Puffay, Riki Shimizu, Haotian Ye, Jonathan Brennan, Helmut Schmid, Hinrich Schütze, Nima Mesgarani
TL;DR
XCOMPS presents a multilingual benchmark of conceptual minimal pairs across 17 languages to evaluate whether LLMs encode language-independent conceptual knowledge. The authors deploy three evaluation modalities—metalinguistic prompting, direct probability measurement, and neurolinguistic probing—to compare base, instruction-tuned, and distillation-based models, revealing language-dependent competence gaps and morphology-related effects. Key findings show cross-language variability, improvements in surface performance with instruction tuning but limited gains in internal competence, and gains in low-resource language competence via distillation with some task trade-offs; morphological complexity imposes deeper encoding requirements. The work provides a scalable framework for assessing multilingual conceptual reasoning and highlights gaps toward universal semantic representations.
Abstract
We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs' multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.
