Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind

Hongchuan Zeng; Hongshen Xu; Lu Chen; Kai Yu

Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind

Hongchuan Zeng, Hongshen Xu, Lu Chen, Kai Yu

TL;DR

Multilingual Brain Surgeon (MBS) addresses the English-centric bias in calibration-based compression for multilingual LLMs by sampling calibration data from languages in proportion to their training-data presence. By decomposing the Hessian as $\mathbf{H}=\sum_n \mathbf{H}_n$ and weighting calibration data via $p_n/p$, MBS aligns parameter-priority with language representation, improving pruning/quantization performance on BLOOM across low-resource languages. The study reveals that language proportion and inter-language similarity govern cross-language retention after compression, and it introduces a cosine-similarity based similarity measure to predict performance drops. Practically, MBS enhances inclusivity in multilingual NLP deployments by reducing language disparities without requiring multilingual fine-tuning.

Abstract

Large Language Models (LLMs) have ushered in a new era in Natural Language Processing, but their massive size demands effective compression techniques for practicality. Although numerous model compression techniques have been investigated, they typically rely on a calibration set that overlooks the multilingual context and results in significant accuracy degradation for low-resource languages. This paper introduces Multilingual Brain Surgeon (MBS), a novel calibration data sampling method for multilingual LLMs compression. MBS overcomes the English-centric limitations of existing methods by sampling calibration data from various languages proportionally to the language distribution of the model training datasets. Our experiments, conducted on the BLOOM multilingual LLM, demonstrate that MBS improves the performance of existing English-centric compression methods, especially for low-resource languages. We also uncover the dynamics of language interaction during compression, revealing that the larger the proportion of a language in the training set and the more similar the language is to the calibration language, the better performance the language retains after compression. In conclusion, MBS presents an innovative approach to compressing multilingual LLMs, addressing the performance disparities and improving the language inclusivity of existing compression techniques.

Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind

TL;DR

and weighting calibration data via

, MBS aligns parameter-priority with language representation, improving pruning/quantization performance on BLOOM across low-resource languages. The study reveals that language proportion and inter-language similarity govern cross-language retention after compression, and it introduces a cosine-similarity based similarity measure to predict performance drops. Practically, MBS enhances inclusivity in multilingual NLP deployments by reducing language disparities without requiring multilingual fine-tuning.

Abstract

Paper Structure (20 sections, 11 equations, 8 figures, 13 tables)

This paper contains 20 sections, 11 equations, 8 figures, 13 tables.

Introduction
Background
Optimal Brain Surgeon (OBS)
Error Measurement
SparseGPT, Wanda and GPTQ
Is Monolingual Calibrating Applicable to Multilingual MC?
Proportion in training data
Similarity between languages
Multilingual Brain Surgeon (MBS)
Experiments
Experimental Setup
Main results
Monolingual Compression Study
Factor 1: Proportion in training data
Factor 2: Similarity between languages
...and 5 more sections

Figures (8)

Figure 1: MBS samples calibration data from different languages proportionally to the language distribution of training datasets. This approach (right part) effectively addresses the multilingual compression problem compared to previous monolingual sampling methods (left part).
Figure 2: Languages with larger corpora have their minimum error closer to the minimum of $E$. Monolingual compression effectively "pushed" the model's state towards the minimum error of that particular language.
Figure 3: The angle between language 2 and language 3 is smaller than that between language 1 and language 2. After element-wise multiplication, language 2 and 3 are more likely to prioritize the same parameter $w_1$ because their angle before multiplication is smaller.
Figure 4: Perplexity for each language and their respective increases when compared to the dense BLOOM-7b1 model after pruning (left) or quantization (right). From left to right, languages are ranked in order from the most well-represented to the least represented.
Figure 5: Monolingual pruning results using Wanda with calibration data in English or Igbo. The size of each bubble corresponds to the magnitude of the increase in perplexity for the model in that particular language, while the vertical axis represents the size of training data in log(bytes) from the language in the training set of BLOOM. The languages with a smaller proportion in the training set experience a greater increase in perplexity.
...and 3 more figures

Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind

TL;DR

Abstract

Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind

Authors

TL;DR

Abstract

Table of Contents

Figures (8)