Table of Contents
Fetching ...

M-Wanda: Improving One-Shot Pruning for Multilingual LLMs

Rochelle Choenni, Ivan Titov

TL;DR

This work addresses the multilingual degradation observed when applying one-shot pruning to large multilingual LLMs, showing that methods like Wanda disproportionately harm non-English languages at practical sparsity levels. It introduces M-Wanda, a multilingual extension that integrates language-aware activation statistics, cross-lingual correlation-based layerwise sparsity (CWL), and an activation-probability term to better preserve both shared and language-specific neurons. Empirical results demonstrate that M-Wanda reduces perplexity across 15 languages, improves six downstream tasks, and generalizes to unseen languages, while remaining effective across sparsity levels and compatible with other pruning frameworks like RIA. By explicitly targeting cross-lingual variance and activation patterns, this approach provides a practical path toward more equitable, efficient multilingual LLMs and motivates broader multilingual evaluation in pruning research.

Abstract

Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.

M-Wanda: Improving One-Shot Pruning for Multilingual LLMs

TL;DR

This work addresses the multilingual degradation observed when applying one-shot pruning to large multilingual LLMs, showing that methods like Wanda disproportionately harm non-English languages at practical sparsity levels. It introduces M-Wanda, a multilingual extension that integrates language-aware activation statistics, cross-lingual correlation-based layerwise sparsity (CWL), and an activation-probability term to better preserve both shared and language-specific neurons. Empirical results demonstrate that M-Wanda reduces perplexity across 15 languages, improves six downstream tasks, and generalizes to unseen languages, while remaining effective across sparsity levels and compatible with other pruning frameworks like RIA. By explicitly targeting cross-lingual variance and activation patterns, this approach provides a practical path toward more equitable, efficient multilingual LLMs and motivates broader multilingual evaluation in pruning research.

Abstract

Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.

Paper Structure

This paper contains 35 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The effect of Wanda pruning under different sparsity ratios on the perplexity of each calibration language. Colored areas denote the increase in perplexity when increasing the sparsity ratio. Note that the perplexity scores are on different scales across models.
  • Figure 2: Performance in accuracy ($\%$) given different sparsity ratios used on different sizes of Llama3. Zero-shot results are averaged across test languages per downstream task.
  • Figure 3: Perplexity scores per language from Llama-8B pruned using Wanda versus M-Wanda.
  • Figure 4: Relative percentage decrease in perplexity when using M-Wanda compared to Wanda for all 15 calibration and 15 unseen test languages. Results are reported for Llama-8B.
  • Figure 5: Average perplexity scores across languages as an effect of higher sparsity ratios when applying Wanda and M-Wanda to Llama-8B.
  • ...and 4 more figures