Table of Contents
Fetching ...

LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

Jian Gao, Richeng Xuan, Zhaolu Kang, Dingshi Liao, Wenxin Huang, Zongmou Huang, Yangdi Xu, Bowen Qin, Zheqi He, Xi Yang, Changjin Li

TL;DR

LaoBench addresses the evaluation gap for Lao, a Southeast Asian low-resource language, by providing over 17,000 items across three dimensions: Knowledge Application, K12 Foundational Education, and Bilingual Translation. It employs a hybrid data construction pipeline with expert curation and agent-assisted verification, and partitions data into open-source Lao-7k and Lao-10k closed-source alongside Lao-500 open-ended prompts, enabling both transparency and secure benchmarking. Experimental results show that current state-of-the-art LLMs still struggle with Lao across tasks, though CoT prompting and larger, instruction-tuned models help. LaoBench aims to catalyze progress in Lao NLP and broader Southeast Asian language technologies by offering a rigorous, culturally grounded benchmark and an official evaluation platform.

Abstract

The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce LaoBench, the first large-scale, high-quality, and multidimensional benchmark dataset dedicated to assessing LLMs' comprehensive language understanding and reasoning abilities in Lao. LaoBench comprises over 17,000 carefully curated samples spanning three core dimensions: knowledge application, K12 foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is divided into open-source and closed-source subsets, with the closed-source portion enabling black-box evaluation on an official platform to ensure fairness and data security. Our data construction pipeline integrates expert human curation with automated agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational value. Benchmarking multiple state-of-the-art LLMs on LaoBench reveals that current models still face significant challenges in mastering Lao across diverse tasks. We hope LaoBench will catalyze further research and development of AI technologies for underrepresented Southeast Asian languages.

LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

TL;DR

LaoBench addresses the evaluation gap for Lao, a Southeast Asian low-resource language, by providing over 17,000 items across three dimensions: Knowledge Application, K12 Foundational Education, and Bilingual Translation. It employs a hybrid data construction pipeline with expert curation and agent-assisted verification, and partitions data into open-source Lao-7k and Lao-10k closed-source alongside Lao-500 open-ended prompts, enabling both transparency and secure benchmarking. Experimental results show that current state-of-the-art LLMs still struggle with Lao across tasks, though CoT prompting and larger, instruction-tuned models help. LaoBench aims to catalyze progress in Lao NLP and broader Southeast Asian language technologies by offering a rigorous, culturally grounded benchmark and an official evaluation platform.

Abstract

The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce LaoBench, the first large-scale, high-quality, and multidimensional benchmark dataset dedicated to assessing LLMs' comprehensive language understanding and reasoning abilities in Lao. LaoBench comprises over 17,000 carefully curated samples spanning three core dimensions: knowledge application, K12 foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is divided into open-source and closed-source subsets, with the closed-source portion enabling black-box evaluation on an official platform to ensure fairness and data security. Our data construction pipeline integrates expert human curation with automated agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational value. Benchmarking multiple state-of-the-art LLMs on LaoBench reveals that current models still face significant challenges in mastering Lao across diverse tasks. We hope LaoBench will catalyze further research and development of AI technologies for underrepresented Southeast Asian languages.

Paper Structure

This paper contains 23 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example cases from LaoBench illustrating the three core evaluation dimensions.
  • Figure 2: Overview of the LaoBench dataset construction pipeline. The process consists of three main stages: (1) Raw Data Acquisition from authoritative Lao sources; (2) Dataset Construction combining expert-driven question formulation for closed questions and automated LLM-based selection for open-ended prompts; and (3) Validation and Quality Assurance integrating human expert review and agent-assisted verification to ensure linguistic accuracy, cultural relevance, and educational value.
  • Figure 3: Distribution of LaoBench samples across the three main categories—Knowledge Application, K12 Education, and Translation—and their respective subdomains, illustrating the dataset’s comprehensive coverage.
  • Figure 4: Overall performance comparison of evaluated models on Lao-7k across K12 Education, Translation, and Knowledge Application categories. Closed-source models generally outperform open-source models, with GPT-5-High and Gemini-2.5-Pro leading in accuracy and BLEU scores respectively.
  • Figure 5: Radar chart comparing performance of models with and without Chain-of-Thought (CoT) prompting on Lao-7k. Models with CoT (Thinking) consistently outperform their Non-Thinking counterparts, especially in complex reasoning subdomains such as Thinking & Philosophy and Knowledge Application.