How Does Quantization Affect Multilingual LLMs?
Kelly Marchisio, Saurabh Dash, Hongyu Chen, Dennis Aumiller, Ahmet Üstün, Sara Hooker, Sebastian Ruder
TL;DR
This work investigates how post-training quantization impacts multilingual LLMs across model scales and languages, revealing that automatic metrics fail to capture the true degradation observed by human evaluators. By evaluating two multilingual model families (Command and Aya) with multiple quantization schemes and testing on automatic benchmarks, LLM-/RM-as-a-Judge, and human judgments, the study shows that non-Latin scripts and challenging tasks such as mathematical reasoning are disproportionately affected. The authors demonstrate that quantization effects vary across languages, tasks, and model sizes, with larger degradation in longer-tail languages and in more difficult prompts, though occasional performance gains are observed under certain configurations like W8A8 or smoothing approaches. These findings argue for including multilingual performance as a central criterion in efficient-model evaluation and deployment, emphasizing the need for broader language coverage and robust evaluation frameworks to ensure equitable NLP deployment worldwide.
Abstract
Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantization on LLMs in English, none have evaluated across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, which automatic metrics severely underestimate: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks like mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models.
