Table of Contents
Fetching ...

Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

Anh Duc Nguyen, Hieu Minh Phi, Anh Viet Ngo, Long Hai Trieu, Thai Phuong Nguyen

TL;DR

This work tackles Vietnamese MRC by fine-tuning two recent LLMs, Llama 3 (8B) and Gemma (7B), on the ViMMRC dataset using Quantized Low-Rank Adaptation (QLoRA). The study demonstrates that small, efficiently fine-tuned models can surpass much larger models such as GPT-3/3.5 and outperform traditional BERT-based baselines, even in resource-constrained settings, with the fine-tuned models publicly released on HuggingFace. Through extensive analyses, it assesses baselines, bias mitigation, and error modes, providing insights into the adaptation of LLMs for low-resource languages and highlighting the practical viability of such approaches. Overall, the paper contributes to Vietnamese NLP by showing how targeted fine-tuning of compact LLMs can achieve SOTA-like performance and by offering ready-to-use models for downstream educational and information-retrieval tasks.

Abstract

Large Language Models (LLMs) have shown remarkable proficiency in Machine Reading Comprehension (MRC) tasks; however, their effectiveness for low-resource languages like Vietnamese remains largely unexplored. In this paper, we fine-tune and evaluate two state-of-the-art LLMs: Llama 3 (8B parameters) and Gemma (7B parameters), on ViMMRC, a Vietnamese MRC dataset. By utilizing Quantized Low-Rank Adaptation (QLoRA), we efficiently fine-tune these models and compare their performance against powerful LLM-based baselines. Although our fine-tuned models are smaller than GPT-3 and GPT-3.5, they outperform both traditional BERT-based approaches and these larger models. This demonstrates the effectiveness of our fine-tuning process, showcasing how modern LLMs can surpass the capabilities of older models like BERT while still being suitable for deployment in resource-constrained environments. Through intensive analyses, we explore various aspects of model performance, providing valuable insights into adapting LLMs for low-resource languages like Vietnamese. Our study contributes to the advancement of natural language processing in low-resource languages, and we make our fine-tuned models publicly available at: https://huggingface.co/iaiuet.

Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

TL;DR

This work tackles Vietnamese MRC by fine-tuning two recent LLMs, Llama 3 (8B) and Gemma (7B), on the ViMMRC dataset using Quantized Low-Rank Adaptation (QLoRA). The study demonstrates that small, efficiently fine-tuned models can surpass much larger models such as GPT-3/3.5 and outperform traditional BERT-based baselines, even in resource-constrained settings, with the fine-tuned models publicly released on HuggingFace. Through extensive analyses, it assesses baselines, bias mitigation, and error modes, providing insights into the adaptation of LLMs for low-resource languages and highlighting the practical viability of such approaches. Overall, the paper contributes to Vietnamese NLP by showing how targeted fine-tuning of compact LLMs can achieve SOTA-like performance and by offering ready-to-use models for downstream educational and information-retrieval tasks.

Abstract

Large Language Models (LLMs) have shown remarkable proficiency in Machine Reading Comprehension (MRC) tasks; however, their effectiveness for low-resource languages like Vietnamese remains largely unexplored. In this paper, we fine-tune and evaluate two state-of-the-art LLMs: Llama 3 (8B parameters) and Gemma (7B parameters), on ViMMRC, a Vietnamese MRC dataset. By utilizing Quantized Low-Rank Adaptation (QLoRA), we efficiently fine-tune these models and compare their performance against powerful LLM-based baselines. Although our fine-tuned models are smaller than GPT-3 and GPT-3.5, they outperform both traditional BERT-based approaches and these larger models. This demonstrates the effectiveness of our fine-tuning process, showcasing how modern LLMs can surpass the capabilities of older models like BERT while still being suitable for deployment in resource-constrained environments. Through intensive analyses, we explore various aspects of model performance, providing valuable insights into adapting LLMs for low-resource languages like Vietnamese. Our study contributes to the advancement of natural language processing in low-resource languages, and we make our fine-tuned models publicly available at: https://huggingface.co/iaiuet.

Paper Structure

This paper contains 16 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: An example of MRC task (English translation is in italics).