Table of Contents
Fetching ...

Understanding the role of FFNs in driving multilingual behaviour in LLMs

Sunit Bhattacharya, Ondřej Bojar

TL;DR

This work probes the multilingual behavior of FFN sublayers in the XGLM family by collecting per-layer activation snapshots during next-token prediction on prefixes drawn from parallel multilingual data. It introduces activation flatness, an entropy-based metric, to quantify language-specific versus multilingual processing across detectors and combinators, examining four model sizes (568M, 1.7B, 2.9B, 7.5B) trained on 500B tokens in 30 languages. Key findings show detectors are multilingual in early layers and become language-specific near the output, while combinators are multilingual in middle layers with final layers showing language-specific patterns in larger models; a notable anomaly is the 2.9B model, whose over-depth appears to degrade generation. The results illuminate how architecture, layer depth, and model size shape multilingual representations in FFNs, offering a lens for understanding cross-language transfer and guiding design choices for multilingual LLMs.

Abstract

Multilingualism in Large Language Models (LLMs) is an yet under-explored area. In this paper, we conduct an in-depth analysis of the multilingual capabilities of a family of a Large Language Model, examining its architecture, activation patterns, and processing mechanisms across languages. We introduce novel metrics to probe the model's multilingual behaviour at different layers and shed light on the impact of architectural choices on multilingual processing. Our findings reveal different patterns of multilinugal processing in the sublayers of Feed-Forward Networks of the models. Furthermore, we uncover the phenomenon of "over-layerization" in certain model configurations, where increasing layer depth without corresponding adjustments to other parameters may degrade model performance. Through comparisons within and across languages, we demonstrate the interplay between model architecture, layer depth, and multilingual processing capabilities of LLMs trained on multiple languages.

Understanding the role of FFNs in driving multilingual behaviour in LLMs

TL;DR

This work probes the multilingual behavior of FFN sublayers in the XGLM family by collecting per-layer activation snapshots during next-token prediction on prefixes drawn from parallel multilingual data. It introduces activation flatness, an entropy-based metric, to quantify language-specific versus multilingual processing across detectors and combinators, examining four model sizes (568M, 1.7B, 2.9B, 7.5B) trained on 500B tokens in 30 languages. Key findings show detectors are multilingual in early layers and become language-specific near the output, while combinators are multilingual in middle layers with final layers showing language-specific patterns in larger models; a notable anomaly is the 2.9B model, whose over-depth appears to degrade generation. The results illuminate how architecture, layer depth, and model size shape multilingual representations in FFNs, offering a lens for understanding cross-language transfer and guiding design choices for multilingual LLMs.

Abstract

Multilingualism in Large Language Models (LLMs) is an yet under-explored area. In this paper, we conduct an in-depth analysis of the multilingual capabilities of a family of a Large Language Model, examining its architecture, activation patterns, and processing mechanisms across languages. We introduce novel metrics to probe the model's multilingual behaviour at different layers and shed light on the impact of architectural choices on multilingual processing. Our findings reveal different patterns of multilinugal processing in the sublayers of Feed-Forward Networks of the models. Furthermore, we uncover the phenomenon of "over-layerization" in certain model configurations, where increasing layer depth without corresponding adjustments to other parameters may degrade model performance. Through comparisons within and across languages, we demonstrate the interplay between model architecture, layer depth, and multilingual processing capabilities of LLMs trained on multiple languages.
Paper Structure (18 sections, 3 equations, 18 figures, 1 algorithm)

This paper contains 18 sections, 3 equations, 18 figures, 1 algorithm.

Figures (18)

  • Figure 1: Transformer block and the structure of FFN
  • Figure 2: Activation frequency for detectors along with standard deviation plotted for English, German and Hindi.
  • Figure 3: Activation frequency for combinators along with standard deviation plotted for English, German and Hindi.
  • Figure 4: Activation flatness for detectors
  • Figure 5: Normalized activations for detectors: XGLM 2.9B for layers 1,27,45,48
  • ...and 13 more figures