What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages
Debangan Mishra, Arihant Rastogi, Agyeya Negi, Shashwat Goel, Ponnurangam Kumaraguru
TL;DR
This work addresses cross-lingual consistency in large language models by applying the CAPA/functional similarity metric $\kappa_p$ to GlobalMMLU across 20 languages. It defines $\kappa_p$ as a chance-adjusted measure that accounts for accuracy, and uses micro-averaging over a multilingual benchmark to compare intra-model and inter-model outputs. The study finds that cross-language output similarity increases with model size and capability, with intra-model consistency typically exceeding inter-model agreement, and observes domain- and resource-related variations (STEM domains and high-resource languages show stronger coherence). The results establish $\kappa_p$ as a practical tool for evaluating multilingual reliability and guiding the development of more consistent multilingual systems, with implications for translation, code-mixing interpretation, and cross-lingual task transfer.
Abstract
How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $κ_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $κ_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.
