Table of Contents
Fetching ...

Fine Tuning Methods for Low-resource Languages

Tim Bakkenes, Daniel Wang, Anton Johansson

TL;DR

This work tackles the underrepresentation of non-English languages in large language models by fine-tuning Google's Gemma 2 for Swedish using a hybrid approach that combines LoRA-based parameter-efficient fine-tuning with Retrieval-Augmented Generation (RAG). It builds two Swedish-focused datasets (a fine-tuning set and a RAG knowledge corpus) and evaluates the model across QA, summarization, and translation tasks using EM, F1, ROUGE, BLEU, METEOR, BERTScore, and COMET. The results show improvements across several tasks, especially when leveraging RAG and pretrained Swedish embeddings, while highlighting overfitting and dataset limitations as key challenges. The study provides a practical blueprint for communities seeking to adapt LLMs to local languages, supporting cultural preservation and inclusive AI deployment, and discusses the trade-offs between compute costs, data quality, and model size. The work also discusses future enhancements such as larger curated datasets, reinforcement learning with human feedback, and more rigorous cross-language evaluations to enhance generalization and trust in multilingual AI systems.

Abstract

The rise of Large Language Models has not been inclusive of all cultures. The models are mostly trained on English texts and culture which makes them underperform in other languages and cultural contexts. By developing a generalizable method for preparing culturally relevant datasets and post-training the Gemma 2 model, this project aimed to increase the performance of Gemma 2 for an underrepresented language and showcase how others can do the same to unlock the power of Generative AI in their country and preserve their cultural heritage.

Fine Tuning Methods for Low-resource Languages

TL;DR

This work tackles the underrepresentation of non-English languages in large language models by fine-tuning Google's Gemma 2 for Swedish using a hybrid approach that combines LoRA-based parameter-efficient fine-tuning with Retrieval-Augmented Generation (RAG). It builds two Swedish-focused datasets (a fine-tuning set and a RAG knowledge corpus) and evaluates the model across QA, summarization, and translation tasks using EM, F1, ROUGE, BLEU, METEOR, BERTScore, and COMET. The results show improvements across several tasks, especially when leveraging RAG and pretrained Swedish embeddings, while highlighting overfitting and dataset limitations as key challenges. The study provides a practical blueprint for communities seeking to adapt LLMs to local languages, supporting cultural preservation and inclusive AI deployment, and discusses the trade-offs between compute costs, data quality, and model size. The work also discusses future enhancements such as larger curated datasets, reinforcement learning with human feedback, and more rigorous cross-language evaluations to enhance generalization and trust in multilingual AI systems.

Abstract

The rise of Large Language Models has not been inclusive of all cultures. The models are mostly trained on English texts and culture which makes them underperform in other languages and cultural contexts. By developing a generalizable method for preparing culturally relevant datasets and post-training the Gemma 2 model, this project aimed to increase the performance of Gemma 2 for an underrepresented language and showcase how others can do the same to unlock the power of Generative AI in their country and preserve their cultural heritage.

Paper Structure

This paper contains 42 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Method Matrix MLSpring2024
  • Figure 2: LoRa configuration.
  • Figure 3: Comparison of model summaries before and after fine-tuning.
  • Figure 4: AdamW Algorithm LoRA2302.06675
  • Figure 5: Learning Rate Schedulers Comparison LearningRateSchedulers2024
  • ...and 5 more figures