Table of Contents
Fetching ...

English K_Quantization of LLMs Does Not Disproportionately Diminish Multilingual Performance

Karl Audun Borgersen, Morten Goodwin

TL;DR

The paper investigates whether English-centric importance matrices used in $k\_quantization$ of LLMs disproportionately diminish multilingual performance. Using Llama3.3 70B and MixEval across English and Norwegian, with matrices in English, Norwegian, and Malayalam, the authors find no statistically significant multilingual degradation when quantizing with non-English matrices, and English-based quantization often performs best. The study highlights that translating the importance matrix and cross-language effects do not yield robust multilingual gains, and reports limitations related to translation biases and the explored model space. Overall, the work suggests that GGUF/k_quantization can preserve multilingual capabilities of large open models without disproportionate costs to non-English performance, aiding practical deployment.

Abstract

For consumer usage of locally deployed LLMs, the GGUF format and k\_quantization are invaluable tools for maintaining the performance of the original model while reducing it to sizes deployable with consumer-grade hardware. The number of bits dedicated to each weight from the original model is reduced based on how important they are thought to be during model inference. This importance is arrived at through the application of an 'importance matrix'-a relatively small text document meant to be representative of the LLM's standard use-cases. In the vast majority of quants available online, this document is primarily written in English. It was therefore an open question whether performance on English language tasks was preserved through the sacrifice of multilingual performance and whether it can be preserved with alternate importance matrices. This article investigates these hypotheses by quantizing Llama3.3 70B on importance matrices written in three languages (English, Norwegian, and Malayalam) and evaluating them on the MixEval dataset in both English and Norwegian. All experiments related to yielded non-significant results indicating that current quantization practices do not disproportionately harm multilingual performance.

English K_Quantization of LLMs Does Not Disproportionately Diminish Multilingual Performance

TL;DR

The paper investigates whether English-centric importance matrices used in of LLMs disproportionately diminish multilingual performance. Using Llama3.3 70B and MixEval across English and Norwegian, with matrices in English, Norwegian, and Malayalam, the authors find no statistically significant multilingual degradation when quantizing with non-English matrices, and English-based quantization often performs best. The study highlights that translating the importance matrix and cross-language effects do not yield robust multilingual gains, and reports limitations related to translation biases and the explored model space. Overall, the work suggests that GGUF/k_quantization can preserve multilingual capabilities of large open models without disproportionate costs to non-English performance, aiding practical deployment.

Abstract

For consumer usage of locally deployed LLMs, the GGUF format and k\_quantization are invaluable tools for maintaining the performance of the original model while reducing it to sizes deployable with consumer-grade hardware. The number of bits dedicated to each weight from the original model is reduced based on how important they are thought to be during model inference. This importance is arrived at through the application of an 'importance matrix'-a relatively small text document meant to be representative of the LLM's standard use-cases. In the vast majority of quants available online, this document is primarily written in English. It was therefore an open question whether performance on English language tasks was preserved through the sacrifice of multilingual performance and whether it can be preserved with alternate importance matrices. This article investigates these hypotheses by quantizing Llama3.3 70B on importance matrices written in three languages (English, Norwegian, and Malayalam) and evaluating them on the MixEval dataset in both English and Norwegian. All experiments related to yielded non-significant results indicating that current quantization practices do not disproportionately harm multilingual performance.

Paper Structure

This paper contains 9 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: A comparison of the first few sentences of an older and a newer iteration of Llama LLM prompted to answer in Norwegian. The Norwegian language writing capabilities of open models have improved dramatically over the last few years. Note that the differences presented here are exaggerated due to the selected examples, and even more so because of the difference in model size. While the newer generations of Llama are significantly better at speaking Norwegian, this is generally only evident after a longer conversation.
  • Figure 2: The hypothesized multilingual workflow of Zhao et al. In their view, LLMs convert multilingual queries to English for reasoning in English before it is converted back into the original language. Figure has been recreated based on one from the original publication of Zhao et al.multilingualLLM
  • Figure 3: One of the example questions from MixEval's five-shot template. To omit both the original language and cultural context of the question renders it incomprehensible. "ku" here is meant to refer to the Kansas Jayhawks, an American basketball team from Kansas University. A Llama3.3 70B instance is able to answer the original question without issue but fails if it is translated to Norwegian due to the lack of cultural context.
  • Figure 4: Comparison of percentage correct answers for three different language quants on both the English and Norwegian datasets for MixEval multiple choice questions. The total correct answers out of 2000 are displayed above each bar. P values for each of the results are displayed in white horizontally along the bar. Though quantized using a Norwegian importance matrix, our results indicate no statistically significant improvement on Norwegian MCQ when compared to the original English quantization, nor does the unrelated language of Malayalam significantly reduce performance in either language.
  • Figure 5: Comparison of performance percentage for three different language quants on both the English and Norwegian datasets of MixEval freeform questions. Total points out of a possible 2000 are displayed above each bar. P values for each of the results are displayed in white horizontally along the bar. To improve legibility, the y-axis has been truncated to the range from 0.7 to 0.9. Though quantized using a Norwegian importance matrix, our results indicate no statistically significant improvement on Norwegian free-form questions when compared to the original English quantization, nor does the unrelated language of Malayalam significantly reduce performance on either language.