Table of Contents
Fetching ...

MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models

Divyanshu Aggarwal, Ashutosh Sathe, Ishaan Watts, Sunayana Sitaram

TL;DR

The paper investigates Parameter Efficient Finetuning (PEFT) for multilingual LLMs by applying LoRA-based finetuning to LLaMA-2-7B and Mistral-7B using synthetic multilingual instruction datasets MultiAlpaca and Bactrian-X. It systematically analyzes the impact of LoRA rank and quantisation (4/8/16-bit) on six downstream tasks spanning 40 languages, focusing on cross-lingual transfer and English retention. Key findings show that higher ranks and certain quantisation levels tend to boost low-resource language performance, that smaller open-source models with PEFT can bridge gaps to larger proprietary models on some tasks, and that English performance can degrade under multilingual finetuning in some settings. The work also compares multilingual data generation versus translation and concludes that base model multilinguality quality often outweighs the method of instruction data creation, with Mistral-7B showing strong cross-lingual capabilities and competitive performance relative to GPT-4 on some benchmarks. The study highlights practical implications for deploying multilingual PEFT under compute constraints and outlines future directions to expand PEFT techniques, mitigate multilinguality challenges, and develop richer multilingual instruction datasets.

Abstract

Parameter Efficient Finetuning (PEFT) has emerged as a viable solution for improving the performance of Large Language Models (LLMs) without requiring massive resources and compute. Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages. Further, there is also a large gap between the performance of smaller open-source models and larger LLMs. Finetuning can be an effective way to bridge this gap and make language models more equitable. In this work, we finetune the LLama-2-7B and Mistral-7B models on two synthetic multilingual instruction tuning datasets to determine its effect on model performance on six downstream tasks covering forty languages in all. Additionally, we experiment with various parameters, such as rank for low-rank adaptation and values of quantisation to determine their effects on downstream performance and find that higher rank and higher quantisation values benefit low-resource languages. We find that PEFT of smaller open-source models sometimes bridges the gap between the performance of these models and the larger ones, however, English performance can take a hit. We also find that finetuning sometimes improves performance on low-resource languages, while degrading performance on high-resource languages.

MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models

TL;DR

The paper investigates Parameter Efficient Finetuning (PEFT) for multilingual LLMs by applying LoRA-based finetuning to LLaMA-2-7B and Mistral-7B using synthetic multilingual instruction datasets MultiAlpaca and Bactrian-X. It systematically analyzes the impact of LoRA rank and quantisation (4/8/16-bit) on six downstream tasks spanning 40 languages, focusing on cross-lingual transfer and English retention. Key findings show that higher ranks and certain quantisation levels tend to boost low-resource language performance, that smaller open-source models with PEFT can bridge gaps to larger proprietary models on some tasks, and that English performance can degrade under multilingual finetuning in some settings. The work also compares multilingual data generation versus translation and concludes that base model multilinguality quality often outweighs the method of instruction data creation, with Mistral-7B showing strong cross-lingual capabilities and competitive performance relative to GPT-4 on some benchmarks. The study highlights practical implications for deploying multilingual PEFT under compute constraints and outlines future directions to expand PEFT techniques, mitigate multilinguality challenges, and develop richer multilingual instruction datasets.

Abstract

Parameter Efficient Finetuning (PEFT) has emerged as a viable solution for improving the performance of Large Language Models (LLMs) without requiring massive resources and compute. Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages. Further, there is also a large gap between the performance of smaller open-source models and larger LLMs. Finetuning can be an effective way to bridge this gap and make language models more equitable. In this work, we finetune the LLama-2-7B and Mistral-7B models on two synthetic multilingual instruction tuning datasets to determine its effect on model performance on six downstream tasks covering forty languages in all. Additionally, we experiment with various parameters, such as rank for low-rank adaptation and values of quantisation to determine their effects on downstream performance and find that higher rank and higher quantisation values benefit low-resource languages. We find that PEFT of smaller open-source models sometimes bridges the gap between the performance of these models and the larger ones, however, English performance can take a hit. We also find that finetuning sometimes improves performance on low-resource languages, while degrading performance on high-resource languages.
Paper Structure (51 sections, 19 figures, 59 tables)

This paper contains 51 sections, 19 figures, 59 tables.

Figures (19)

  • Figure 1: Comparison of best parameter efficient instruction finetuned models with other off the shelf LLMs. Notably, the best Mistral instruction finetuned model is able to outperform even GPT-4 and PaLM2 on "MLQA" and "XQUAD" tasks.
  • Figure 2: Average model performance of LLaMA-2-7B and Mistral-7B finetuned on Bactrian-X-22, Bactrian-X-11 and MultiAlpaca across tasks on all rank-quantisation configurations.
  • Figure 3: Belebele evaluation results of LLaMA-2-7B and Mistral-7B finetuned on Bactrian-X-22, Bactrian-X-11 and MultiAlpaca across tasks on all rank-quantisation configurations.
  • Figure 4: Effect of diversity of languages in fine-tuning on downstream task (belebele). Here Group 1 is the set of 11 languages from MultiAlpaca, Group 2 is the set of 11 languages in Bactrian-X-22 but not in MultiAlpaca and Group 3 contains 13 languages present in neither. We find that both models trained on either datasets perform very similar to each other across all 3 groups. Additional details in Tables \ref{['tab:results_belebele_llama_alpaca']} to \ref{['tab:results_belebele_mistral_bactrian']}.
  • Figure 5: Detailed language-wise comparison of our finetuned Mistral-7B and LLaMA-2-7B models with other baselines ahuja2023megaverse on Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish and Vietnamese for XQUAD artetxe2020cross.
  • ...and 14 more figures