Table of Contents
Fetching ...

What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

Debangan Mishra, Arihant Rastogi, Agyeya Negi, Shashwat Goel, Ponnurangam Kumaraguru

TL;DR

This work addresses cross-lingual consistency in large language models by applying the CAPA/functional similarity metric $\kappa_p$ to GlobalMMLU across 20 languages. It defines $\kappa_p$ as a chance-adjusted measure that accounts for accuracy, and uses micro-averaging over a multilingual benchmark to compare intra-model and inter-model outputs. The study finds that cross-language output similarity increases with model size and capability, with intra-model consistency typically exceeding inter-model agreement, and observes domain- and resource-related variations (STEM domains and high-resource languages show stronger coherence). The results establish $\kappa_p$ as a practical tool for evaluating multilingual reliability and guiding the development of more consistent multilingual systems, with implications for translation, code-mixing interpretation, and cross-lingual task transfer.

Abstract

How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $κ_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $κ_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.

What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

TL;DR

This work addresses cross-lingual consistency in large language models by applying the CAPA/functional similarity metric to GlobalMMLU across 20 languages. It defines as a chance-adjusted measure that accounts for accuracy, and uses micro-averaging over a multilingual benchmark to compare intra-model and inter-model outputs. The study finds that cross-language output similarity increases with model size and capability, with intra-model consistency typically exceeding inter-model agreement, and observes domain- and resource-related variations (STEM domains and high-resource languages show stronger coherence). The results establish as a practical tool for evaluating multilingual reliability and guiding the development of more consistent multilingual systems, with implications for translation, code-mixing interpretation, and cross-lingual task transfer.

Abstract

How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.

Paper Structure

This paper contains 14 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Our Main Findings: We use functional similarity to measure the consistency of model outputs across different languages. We find: (1) as language models get bigger and more capable, their outputs become more similar across languages; (2) models tend to be more self-consistent across languages than when comparing different models in a common language.
  • Figure 2: $\kappa_p$ correlates positively with model size and accuracy. (a) $\kappa_p$ averaged over languages positively correlates with model size (b) Similarly, $\kappa_p$ averaged over languages positively correlates with model performance. This indicates that models grow similar across languages with their capability and size.
  • Figure 3: Models answer more similarly across languages for STEM than other domains. Each heatmap cell represents the $\kappa_p$ and accuracy averaged over languages. For example, a cell value of (0.3 | 0.4) for a given model and category would represent an average $\kappa_p$ of 0.3 and an average accuracy of 40%, both averaged over all the languages.
  • Figure 4: Intra-model $\kappa_p$ scores are higher for categories belonging to STEM (Mathematics, Physics, Computer Science) than the Humanities (Philosophy, Psychology, Sociology). (a) Family of Gemma models (b) Family of Qwen Models.
  • Figure 5: Frequency density distribution of the intra-model (across 20 language pairs) and inter-model (1 model vs remaining 7) $\kappa_p$ scores along with the p-values of the Mann-Whitney U Test. Intra-Model similarity is greater for all models than Inter-Model similarity with high significance.
  • ...and 3 more figures