Table of Contents
Fetching ...

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia

Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Mahani Aljunied, Anh Tuan Luu, Lidong Bing

TL;DR

SeaExam and SeaBench address the need for Southeast Asian (SEA)-specific multilingual benchmarks by constructing real-world SEA content rather than translated benchmarks. SeaExam draws from regional exams, while SeaBench comprises native-crafted, open-ended, multi-turn tasks that reflect SEA daily interactions and sensitivities. Evaluations across nine LLMs show these benchmarks align more closely with actual SEA usage and better reveal cross-language and cross-model capabilities, though safety performance in multilingual contexts remains a challenge. The work highlights the importance of real-world, culturally grounded benchmarks and suggests expanding language coverage and introducing dynamic updates to sustain relevance.

Abstract

This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia

TL;DR

SeaExam and SeaBench address the need for Southeast Asian (SEA)-specific multilingual benchmarks by constructing real-world SEA content rather than translated benchmarks. SeaExam draws from regional exams, while SeaBench comprises native-crafted, open-ended, multi-turn tasks that reflect SEA daily interactions and sensitivities. Evaluations across nine LLMs show these benchmarks align more closely with actual SEA usage and better reveal cross-language and cross-model capabilities, though safety performance in multilingual contexts remains a challenge. The work highlights the importance of real-world, culturally grounded benchmarks and suggests expanding language coverage and introducing dynamic updates to sustain relevance.

Abstract

This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.

Paper Structure

This paper contains 29 sections, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Compared with local usage queries in Vietnamese, questions in English-based translations show more American context (Hawaii). To better illustrate this discrepancy, we extracted the object in these questions and visualised their distribution. The results show that the objects in translated questions cover only a small portion of those in local usage queries.
  • Figure 2: Data Examples for the three languages in (a) SeaExam and (b) SeaBench. The correct answer for SeaExam is in bold. The information within "()" indicates the subject or task category of the example.
  • Figure 3: Cluster distance between each benchmark and Wild Queries. (a) Cluster distance of entity embeddings between each exam dataset and Wild Queries. (b) Cluster distance of sentence embeddings between each multi-turn dataset and Wild Queries. A smaller value means more similar to Wild Queries.
  • Figure 4: (a) Accuracy standard deviation across the nine models for each language on SeaExam and MMLU-SEA. (b) Score standard deviation across the nine models for each language on SeaBench and MT-bench-SEA.
  • Figure 5: (a) Accuracy standard deviation across three SEA languages for the nine models on SeaExam and MMLU-SEA. (b) Score standard deviation across three SEA languages for the nine models on SeaBench and MT-bench-SEA.
  • ...and 8 more figures