On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning

Simon Kurz; Jian-Jia Chen; Lucie Flek; Zhixue Zhao

On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning

Simon Kurz, Jian-Jia Chen, Lucie Flek, Zhixue Zhao

TL;DR

This paper investigates how the choice of calibration language during post-training pruning affects multilingual LLMs when targeting monolingual tasks. By comparing Wanda and SparseGPT pruning across seven languages on Llama-3 and Aya-23, it shows that calibrating on the target language minimizes perplexity-related degradation but does not reliably improve downstream task performance, and in some cases other languages yield better results. Internal analyses reveal that pruning tends to preserve language-specific features (helping language modeling metrics) while failing to retain language-agnostic reasoning and knowledge, especially in middle-to-late layers. The findings highlight fundamental limitations in current pruning methods for multilingual settings and motivate developing strategies that preserve cross-language, language-agnostic information to support robust reasoning and knowledge retrieval. The work has practical implications for deploying pruned multilingual LLMs in diverse language contexts and informs future directions in calibration-aware pruning and representation-preserving techniques.

Abstract

Recent advances in large language model (LLM) pruning have shown state-of-the-art (SotA) compression results in post-training and retraining-free settings while maintaining high predictive performance. However, previous research mainly considered calibrating based on English text, despite the multilingual nature of modern LLMs and their frequent use in non-English languages. This analysis paper conducts an in-depth investigation of the performance and internal representation changes associated with pruning multilingual language models for monolingual applications. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques. We further analyze the latent subspaces, pruning masks, and individual neurons within pruned models. Our results reveal that while calibration on the target language effectively retains perplexity and yields high signal-to-noise ratios, it does not consistently improve downstream task performance. Further analysis of internal representations at three different levels highlights broader limitations of current pruning approaches: While they effectively preserve dominant information like language-specific features, this is insufficient to counteract the loss of nuanced, language-agnostic features that are crucial for knowledge retention and reasoning.

On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning

TL;DR

Abstract

Paper Structure (39 sections, 7 equations, 8 figures, 12 tables)

This paper contains 39 sections, 7 equations, 8 figures, 12 tables.

Introduction
Background
Model Pruning
Surface-Level Evaluation Metrics
Related Work
Multilingual Language Models
Calibration of Post-training Pruning
Methodology
Results
Pruning Results
Downstream Task Performance
Open Domain Question Answering without Context
Multiple Calibration Languages
Impact of Model Sizes
Quantization
...and 24 more sections

Figures (8)

Figure 1: Post-pruning magnitude difference for language-agnostic (Figure \ref{['fig:subfig_agnostic_diff']}) and -specific features (Figure \ref{['fig:subfig_specific_diff']}), averaged over 900 Belebele samples per language. The X-axis indicates the evaluation language, the calibration language is color-coded: AR, DE, EN, ES, RU, ZH. Larger $\Delta$ means larger pruning error on the respective features. See Figure \ref{['app:figure_lsar_a_mc4_belebele']} and \ref{['app:figure_lsar_s_mc4_belebele']} in Appendix \ref{['app:further_language_subspace_results']} for the full plot over all layers. A star marks matching calibration and evaluation languages with the smallest post-pruning difference.
Figure 2: Pruning mask similarities (IoUs) between using different calibration languages for 50% SparseGPT-pruned Llama 3 8B models. \ref{['subfig:iou_intra']}) IoUs of pruning masks for three calibration sets of the same language. \ref{['subfig:iou_inter']}) IoUs between pruning masks for different calibration languages. The higher IoU (indicated as a lighter color), the more similar pruning masks between different calibration languages.
Figure 3: Statistics for FFN neurons of a full-sized Llama-3 8B and its 50% SparseGPT-pruned version calibrated for DE. Plots show a 95% confidence interval and highlighted mean values. All neurons are ordered by ascending LAPE score of the full-sized model as shown by the dashed line in \ref{['fig:language_entropy_base']}. Additionally, LAPE score and activation probabilities get correlated by removing all neurons with an activation probability in DE that is less then the average activation probability among all languages. The lower the LAPE score, the more specialized the neuron is for a particular language.
Figure 4: Language-wise mean magnitude of differences between the prompt-wise and layer-wise averaged language-agnostic features extracted with LSAR for a full-sized and 50% SparseGPT-pruned and mC4-calibratedLlama-3 8b model. Both, the LSAR projection matrix and the feature differences, were computed over 900 prompts from the Belebele dataset for the six calibration/test languages. The evaluation languages are shown on the x-axis, the calibration languages are color-coded (AR, DE, EN, ES, RU, ZH). The background color indicates the magnitude of the maximum deviation. A star marks the case where using the same language for calibration and evaluation results in the smallest difference after pruning.
Figure 5: Language-wise mean magnitude of differences between the prompt-wise and layer-wise averaged language-specific features extracted with LSAR for a full-sized and 50% SparseGPT-pruned and mC4-calibratedLlama-3 8b model. Both, the LSAR projection matrix and the feature differences, were computed over 900 prompts from the Belebele dataset for the six calibration/test languages. The evaluation languages are shown on the x-axis, the calibration languages are color-coded (AR, DE, EN, ES, RU, ZH). The background color indicates the magnitude of the maximum deviation. A star marks the case where using the same language for calibration and evaluation results in the smallest difference after pruning.
...and 3 more figures

On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning

TL;DR

Abstract

On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)