Table of Contents
Fetching ...

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

Sang T. Truong, Duc Q. Nguyen, Toan Nguyen, Dong D. Le, Nhi N. Truong, Tho Quan, Sanmi Koyejo

TL;DR

This work tackles the shortage of robust Vietnamese LLMs and benchmarks by finetuning five open Vietnamese LLMs (URA-LLaMa 7B/13B/70B, MixSUra 7B, GemSUra 7B) using QLoRA on Vietnamese Wikipedia, News-Corpus, and high-school essays, and by building an open evaluation framework across ten real-world scenarios with 31 metrics. It shows that high-quality fine-tuning data is a key driver of performance and reveals that larger models may introduce more bias and toxicity without careful data curation; nevertheless, the finetuned models outperform their base counterparts. The study also provides two novel Vietnamese reasoning datasets and an open-source evaluation toolkit, enabling reproducibility and community-driven benchmarking. Overall, the results demonstrate that principled fine-tuning with curated datasets can unlock strong Vietnamese capabilities in LLMs and highlight the importance of fairness and toxicity considerations in multilingual AI development.

Abstract

Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 common tasks and 31 metrics. Our evaluation results reveal that the fine-tuned LLMs exhibit enhanced comprehension and generative capabilities in Vietnamese. Moreover, our analysis indicates that models with more parameters can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or fine-tuning datasets. These insights underscore the significance of meticulous fine-tuning with high-quality datasets in enhancing LLM performance.

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

TL;DR

This work tackles the shortage of robust Vietnamese LLMs and benchmarks by finetuning five open Vietnamese LLMs (URA-LLaMa 7B/13B/70B, MixSUra 7B, GemSUra 7B) using QLoRA on Vietnamese Wikipedia, News-Corpus, and high-school essays, and by building an open evaluation framework across ten real-world scenarios with 31 metrics. It shows that high-quality fine-tuning data is a key driver of performance and reveals that larger models may introduce more bias and toxicity without careful data curation; nevertheless, the finetuned models outperform their base counterparts. The study also provides two novel Vietnamese reasoning datasets and an open-source evaluation toolkit, enabling reproducibility and community-driven benchmarking. Overall, the results demonstrate that principled fine-tuning with curated datasets can unlock strong Vietnamese capabilities in LLMs and highlight the importance of fairness and toxicity considerations in multilingual AI development.

Abstract

Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 common tasks and 31 metrics. Our evaluation results reveal that the fine-tuned LLMs exhibit enhanced comprehension and generative capabilities in Vietnamese. Moreover, our analysis indicates that models with more parameters can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or fine-tuning datasets. These insights underscore the significance of meticulous fine-tuning with high-quality datasets in enhancing LLM performance.
Paper Structure (64 sections, 1 equation, 10 figures, 12 tables)

This paper contains 64 sections, 1 equation, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overall capacities of LLMs
  • Figure 2: Performance on zero-shot prompt
  • Figure 3: Performance with few-shot prompt
  • Figure 4: Performance with Chain-of-Thought prompt
  • Figure 5: Performance under weaker prompt
  • ...and 5 more figures