Table of Contents
Fetching ...

How Does Quantization Affect Multilingual LLMs?

Kelly Marchisio, Saurabh Dash, Hongyu Chen, Dennis Aumiller, Ahmet Üstün, Sara Hooker, Sebastian Ruder

TL;DR

This work investigates how post-training quantization impacts multilingual LLMs across model scales and languages, revealing that automatic metrics fail to capture the true degradation observed by human evaluators. By evaluating two multilingual model families (Command and Aya) with multiple quantization schemes and testing on automatic benchmarks, LLM-/RM-as-a-Judge, and human judgments, the study shows that non-Latin scripts and challenging tasks such as mathematical reasoning are disproportionately affected. The authors demonstrate that quantization effects vary across languages, tasks, and model sizes, with larger degradation in longer-tail languages and in more difficult prompts, though occasional performance gains are observed under certain configurations like W8A8 or smoothing approaches. These findings argue for including multilingual performance as a central criterion in efficient-model evaluation and deployment, emphasizing the need for broader language coverage and robust evaluation frameworks to ensure equitable NLP deployment worldwide.

Abstract

Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantization on LLMs in English, none have evaluated across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, which automatic metrics severely underestimate: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks like mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models.

How Does Quantization Affect Multilingual LLMs?

TL;DR

This work investigates how post-training quantization impacts multilingual LLMs across model scales and languages, revealing that automatic metrics fail to capture the true degradation observed by human evaluators. By evaluating two multilingual model families (Command and Aya) with multiple quantization schemes and testing on automatic benchmarks, LLM-/RM-as-a-Judge, and human judgments, the study shows that non-Latin scripts and challenging tasks such as mathematical reasoning are disproportionately affected. The authors demonstrate that quantization effects vary across languages, tasks, and model sizes, with larger degradation in longer-tail languages and in more difficult prompts, though occasional performance gains are observed under certain configurations like W8A8 or smoothing approaches. These findings argue for including multilingual performance as a central criterion in efficient-model evaluation and deployment, emphasizing the need for broader language coverage and robust evaluation frameworks to ensure equitable NLP deployment worldwide.

Abstract

Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantization on LLMs in English, none have evaluated across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, which automatic metrics severely underestimate: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks like mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models.
Paper Structure (40 sections, 4 equations, 2 figures, 30 tables)

This paper contains 40 sections, 4 equations, 2 figures, 30 tables.

Figures (2)

  • Figure 1: Automatic metrics severely underestimate damage from quantization. Shown: 103B W4 quantized Command model with group-wise scaling vs. FP16. Avg: mMMLU, FLORES, Language Confusion (LC). English avg: mMMLU, MGSM, monolingual LC.
  • Figure 2: Data size in mC4 xue-etal-2021-mt5 vs. avg. perf. under quantization. Table \ref{['tab:avgs-no-mgsm']}, Command 103B.