Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

Anh Duc Nguyen; Hieu Minh Phi; Anh Viet Ngo; Long Hai Trieu; Thai Phuong Nguyen

Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

Anh Duc Nguyen, Hieu Minh Phi, Anh Viet Ngo, Long Hai Trieu, Thai Phuong Nguyen

TL;DR

This work tackles Vietnamese MRC by fine-tuning two recent LLMs, Llama 3 (8B) and Gemma (7B), on the ViMMRC dataset using Quantized Low-Rank Adaptation (QLoRA). The study demonstrates that small, efficiently fine-tuned models can surpass much larger models such as GPT-3/3.5 and outperform traditional BERT-based baselines, even in resource-constrained settings, with the fine-tuned models publicly released on HuggingFace. Through extensive analyses, it assesses baselines, bias mitigation, and error modes, providing insights into the adaptation of LLMs for low-resource languages and highlighting the practical viability of such approaches. Overall, the paper contributes to Vietnamese NLP by showing how targeted fine-tuning of compact LLMs can achieve SOTA-like performance and by offering ready-to-use models for downstream educational and information-retrieval tasks.

Abstract

Large Language Models (LLMs) have shown remarkable proficiency in Machine Reading Comprehension (MRC) tasks; however, their effectiveness for low-resource languages like Vietnamese remains largely unexplored. In this paper, we fine-tune and evaluate two state-of-the-art LLMs: Llama 3 (8B parameters) and Gemma (7B parameters), on ViMMRC, a Vietnamese MRC dataset. By utilizing Quantized Low-Rank Adaptation (QLoRA), we efficiently fine-tune these models and compare their performance against powerful LLM-based baselines. Although our fine-tuned models are smaller than GPT-3 and GPT-3.5, they outperform both traditional BERT-based approaches and these larger models. This demonstrates the effectiveness of our fine-tuning process, showcasing how modern LLMs can surpass the capabilities of older models like BERT while still being suitable for deployment in resource-constrained environments. Through intensive analyses, we explore various aspects of model performance, providing valuable insights into adapting LLMs for low-resource languages like Vietnamese. Our study contributes to the advancement of natural language processing in low-resource languages, and we make our fine-tuned models publicly available at: https://huggingface.co/iaiuet.

Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

TL;DR

Abstract

Investigating Recent Large Language Models for Vietnamese Machine Reading Comprehension

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)