Pruning Multilingual Large Language Models for Multilingual Inference

Hwichan Kim; Jun Suzuki; Tosho Hirasawa; Mamoru Komachi

Pruning Multilingual Large Language Models for Multilingual Inference

Hwichan Kim, Jun Suzuki, Tosho Hirasawa, Mamoru Komachi

TL;DR

The paper addresses persistent gaps in non-English performance for multilingual LLMs by revealing that large-magnitude hidden features activated during few-shot translation demonstrations underpin alignment between languages. It proposes a pruning-based approach (Wanda-style) that retains weights associated with these large-magnitude features to force models to rely on translation-aligned signals for zero-shot tasks, thereby improving non-English inference in XGLM and mGPT, with BLOOM benefiting from a refined metric to reduce noise from programming-language generation. Empirical results show improved cross-lingual transfer and RankC consistency for several models and languages, and extend to larger scale models (XNLI, MARC, XGLM-7.5B). The findings highlight practical pathways to enhance multilingual capabilities without fine-tuning, while acknowledging limitations in hyperparameter exploration, architectural analyses, and concerns about bias in multilingual alignment.

Abstract

Multilingual large language models (MLLMs), trained on multilingual balanced data, demonstrate better zero-shot learning performance in non-English languages compared to large language models trained on English-dominant data. However, the disparity in performance between English and non-English languages remains a challenge yet to be fully addressed. A distinctive characteristic of MLLMs is their high-quality translation capabilities, indicating an acquired proficiency in aligning between languages. This study explores how to enhance the zero-shot performance of MLLMs in non-English languages by leveraging their alignment capability between English and non-English languages. To achieve this, we first analyze the behavior of MLLMs when performing translation and reveal that there are large magnitude features that play a critical role in the translation process. Inspired by these findings, we retain the weights associated with operations involving the large magnitude features and prune other weights to force MLLMs to rely on these features for tasks beyond translation. We empirically demonstrate that this pruning strategy can enhance the MLLMs' performance in non-English language.

Pruning Multilingual Large Language Models for Multilingual Inference

TL;DR

Abstract

Paper Structure (28 sections, 8 equations, 42 figures, 11 tables)

This paper contains 28 sections, 8 equations, 42 figures, 11 tables.

Introduction
Task Setting
Related Works
English LLM and their characteristics
Enhancing multilingual performance of multilingual pre-trained models
Detecting Translation Features
RQ1: Do few-shot translation demonstrations activate specific features?
RQ2: Are the large magnitude features relevant for translation performance?
Experimental Settings
Experimental Results
Answer to RQ1: Few-shot translation demonstrations activate specific features up to middle layer.
Answer to RQ2: The large magnitude features are relevant for maintaining translation performance.
Multilinguality of Pruned MLLMs
Experimental Settings
Experimental Results
...and 13 more sections

Figures (42)

Figure 1: The overlap ratios among the top- and bottom-30% features in the 27-th layer of XGLM, ranked by their magnitude. The row and column labels correspond to languages and language pairs used in few-shot monolingual (En, Fr, etc.) and translation (Fr-En, Es-En, etc.) demonstrations, respectively. Each element represents the ratios of overlapping features between the top- and bottom-30% in magnitude within each demonstration. This figure shows that specific features are active only when inputting translation demonstrations.
Figure 2: The top 20 dimensions with the largest magnitudes of 27-th layer's features of XGLM activated when inputting $X_\mathrm{Zh-En}$.
Figure 3: The overlap ratios among the top- and bottom- 30% features in the 47th layer of XGLM, ranked by magnitude.
Figure 4: Averaged overlap ratios for each quadrant. This plot quantifies the overlap between monolingual and translation demonstrations in the upper-left, upper-right, lower-left, and lower-right quadrants across different layers.
Figure 5: $\lVert \boldsymbol{X}^3_\text{Zh} \rVert_2$
...and 37 more figures

Pruning Multilingual Large Language Models for Multilingual Inference

TL;DR

Abstract

Pruning Multilingual Large Language Models for Multilingual Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (42)