Table of Contents
Fetching ...

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis

Daoyang Li, Haiyan Zhao, Qingcheng Zeng, Mengnan Du

TL;DR

This work tackles the problem of understanding how large language models encode multilingual information by extending probing beyond English to 16 languages using linear classifier probes on decoder-only LLMs. The authors analyze layer-wise internal representations and probing-vector similarities across two datasets (Cities and Opinion) using five open-source models (Qwen and Gemma). Key findings show consistent performance gaps between high-resource and low-resource languages, English-like deep-layer improvements for high-resource languages, and stronger cross-language similarity among high-resource languages. These results highlight systemic limitations in current LLM multilingual capabilities and motivate development of more equitable multilingual models, potentially extending to multimodal settings.

Abstract

Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world's languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs' multilingual capabilities and emphasize the need for improved modeling of low-resource languages.

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis

TL;DR

This work tackles the problem of understanding how large language models encode multilingual information by extending probing beyond English to 16 languages using linear classifier probes on decoder-only LLMs. The authors analyze layer-wise internal representations and probing-vector similarities across two datasets (Cities and Opinion) using five open-source models (Qwen and Gemma). Key findings show consistent performance gaps between high-resource and low-resource languages, English-like deep-layer improvements for high-resource languages, and stronger cross-language similarity among high-resource languages. These results highlight systemic limitations in current LLM multilingual capabilities and motivate development of more equitable multilingual models, potentially extending to multimodal settings.

Abstract

Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world's languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs' multilingual capabilities and emphasize the need for improved modeling of low-resource languages.
Paper Structure (10 sections, 3 equations, 6 figures, 2 tables)

This paper contains 10 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Layer-wise probing accuracy of 5 open-source LLMs across 16 languages.
  • Figure 2: (a) Heatmap of the similarities of probing vectors correlation across languages; (b) Cosine similarity of probing vectors with English. (Model: Qwen-1.8B, Dataset: Opinion).
  • Figure 3: Prompt templates of all languages used in experiments.
  • Figure 4: Additional results for multilingual accuracy of Qwen and Gemma Series Model on the Opinion Dataset
  • Figure 5: Heatmap of the similarities of probing vectors correlation across languages.
  • ...and 1 more figures