ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

Trong-Hieu Nguyen; Anh-Cuong Le; Viet-Cuong Nguyen

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

Trong-Hieu Nguyen, Anh-Cuong Le, Viet-Cuong Nguyen

TL;DR

ViLLM-Eval addresses the gap in Vietnamese-language evaluation by introducing four datasets (LAMBADA_vi, Exam Vietnamese, General Knowledge Vietnamese, Comprehension QA Vietnamese) to assess advanced knowledge and contextual reasoning in LLMs using $PPL$ and $acc$. It evaluates a diverse set of Vietnamese-enabled models, including Dama-2 7B and SeaLLM-7B-v2, under a five-shot MCQ framework and a perplexity-based assessment regime. The findings show that even top models have substantial room for improvement on Vietnamese language tasks, underscoring the need for language-specific data and fine-tuning. The work provides a robust, culturally attuned evaluation platform that supports continued advancement of Vietnamese LLMs and informs VLSP 2023 challenges.

Abstract

The rapid advancement of large language models (LLMs) necessitates the development of new benchmarks to accurately assess their capabilities. To address this need for Vietnamese, this work aims to introduce ViLLM-Eval, the comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models within a Vietnamese context. ViLLM-Eval consists of multiple-choice questions and predict next word tasks spanning various difficulty levels and diverse disciplines, ranging from humanities to science and engineering. A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement in understanding and responding to Vietnamese language tasks. ViLLM-Eval is believed to be instrumental in identifying key strengths and weaknesses of foundation models, ultimately promoting their development and enhancing their performance for Vietnamese users. This paper provides a thorough overview of ViLLM-Eval as part of the Vietnamese Large Language Model shared task, held within the 10th International Workshop on Vietnamese Language and Speech Processing (VLSP 2023).

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

TL;DR

and

. It evaluates a diverse set of Vietnamese-enabled models, including Dama-2 7B and SeaLLM-7B-v2, under a five-shot MCQ framework and a perplexity-based assessment regime. The findings show that even top models have substantial room for improvement on Vietnamese language tasks, underscoring the need for language-specific data and fine-tuning. The work provides a robust, culturally attuned evaluation platform that supports continued advancement of Vietnamese LLMs and informs VLSP 2023 challenges.

Abstract

Paper Structure (24 sections, 2 figures, 4 tables)

This paper contains 24 sections, 2 figures, 4 tables.

Introduction
Related Work
Methodology
Design Principle
LAMBADA_vi
Data sources
Data Processing
Exam Vietnamese Dataset
Data sources
Data processing
General Knowledge Vietnamese Dataset
Data Sources
Data Processing
Comprehension QA Vietnamese Dataset
Data Sources
...and 9 more sections

Figures (2)

Figure 1: Process of creating LAMBADA_vi dataset
Figure 2: Process of collecting, rechecking, and creating multiple-choice data

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

TL;DR

Abstract

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)