Pruning Multilingual Large Language Models for Multilingual Inference
Hwichan Kim, Jun Suzuki, Tosho Hirasawa, Mamoru Komachi
TL;DR
The paper addresses persistent gaps in non-English performance for multilingual LLMs by revealing that large-magnitude hidden features activated during few-shot translation demonstrations underpin alignment between languages. It proposes a pruning-based approach (Wanda-style) that retains weights associated with these large-magnitude features to force models to rely on translation-aligned signals for zero-shot tasks, thereby improving non-English inference in XGLM and mGPT, with BLOOM benefiting from a refined metric to reduce noise from programming-language generation. Empirical results show improved cross-lingual transfer and RankC consistency for several models and languages, and extend to larger scale models (XNLI, MARC, XGLM-7.5B). The findings highlight practical pathways to enhance multilingual capabilities without fine-tuning, while acknowledging limitations in hyperparameter exploration, architectural analyses, and concerns about bias in multilingual alignment.
Abstract
Multilingual large language models (MLLMs), trained on multilingual balanced data, demonstrate better zero-shot learning performance in non-English languages compared to large language models trained on English-dominant data. However, the disparity in performance between English and non-English languages remains a challenge yet to be fully addressed. A distinctive characteristic of MLLMs is their high-quality translation capabilities, indicating an acquired proficiency in aligning between languages. This study explores how to enhance the zero-shot performance of MLLMs in non-English languages by leveraging their alignment capability between English and non-English languages. To achieve this, we first analyze the behavior of MLLMs when performing translation and reveal that there are large magnitude features that play a critical role in the translation process. Inspired by these findings, we retain the weights associated with operations involving the large magnitude features and prune other weights to force MLLMs to rely on these features for tasks beyond translation. We empirically demonstrate that this pruning strategy can enhance the MLLMs' performance in non-English language.
