MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models
Divyanshu Aggarwal, Ashutosh Sathe, Ishaan Watts, Sunayana Sitaram
TL;DR
The paper investigates Parameter Efficient Finetuning (PEFT) for multilingual LLMs by applying LoRA-based finetuning to LLaMA-2-7B and Mistral-7B using synthetic multilingual instruction datasets MultiAlpaca and Bactrian-X. It systematically analyzes the impact of LoRA rank and quantisation (4/8/16-bit) on six downstream tasks spanning 40 languages, focusing on cross-lingual transfer and English retention. Key findings show that higher ranks and certain quantisation levels tend to boost low-resource language performance, that smaller open-source models with PEFT can bridge gaps to larger proprietary models on some tasks, and that English performance can degrade under multilingual finetuning in some settings. The work also compares multilingual data generation versus translation and concludes that base model multilinguality quality often outweighs the method of instruction data creation, with Mistral-7B showing strong cross-lingual capabilities and competitive performance relative to GPT-4 on some benchmarks. The study highlights practical implications for deploying multilingual PEFT under compute constraints and outlines future directions to expand PEFT techniques, mitigate multilinguality challenges, and develop richer multilingual instruction datasets.
Abstract
Parameter Efficient Finetuning (PEFT) has emerged as a viable solution for improving the performance of Large Language Models (LLMs) without requiring massive resources and compute. Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages. Further, there is also a large gap between the performance of smaller open-source models and larger LLMs. Finetuning can be an effective way to bridge this gap and make language models more equitable. In this work, we finetune the LLama-2-7B and Mistral-7B models on two synthetic multilingual instruction tuning datasets to determine its effect on model performance on six downstream tasks covering forty languages in all. Additionally, we experiment with various parameters, such as rank for low-rank adaptation and values of quantisation to determine their effects on downstream performance and find that higher rank and higher quantisation values benefit low-resource languages. We find that PEFT of smaller open-source models sometimes bridges the gap between the performance of these models and the larger ones, however, English performance can take a hit. We also find that finetuning sometimes improves performance on low-resource languages, while degrading performance on high-resource languages.
