Table of Contents
Fetching ...

SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, Qian Liu

TL;DR

SailCompass introduces a reproducible evaluation benchmark for Southeast Asian languages, covering Indonesian, Vietnamese, and Thai with 14 datasets across generation, MCQ, and classification tasks within the OpenCompass framework. It systematically analyzes generation, MCQ, and classification evaluation, applying prompt variants and calibration to improve robustness and faithfulness. Key findings show SEA-specific LLMs generally outperform general models, that balanced language distribution improves SEA-model performance, and that advanced prompting (including calibration and perplexity-based ranking) enhances evaluation reliability. The benchmark and code are public, enabling reproducible assessment and targeted improvement of LLMs for SEA languages with practical implications for multilingual NLP research and deployment.

Abstract

In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize LLMs. All datasets and evaluation scripts are public.

SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

TL;DR

SailCompass introduces a reproducible evaluation benchmark for Southeast Asian languages, covering Indonesian, Vietnamese, and Thai with 14 datasets across generation, MCQ, and classification tasks within the OpenCompass framework. It systematically analyzes generation, MCQ, and classification evaluation, applying prompt variants and calibration to improve robustness and faithfulness. Key findings show SEA-specific LLMs generally outperform general models, that balanced language distribution improves SEA-model performance, and that advanced prompting (including calibration and perplexity-based ranking) enhances evaluation reliability. The benchmark and code are public, enabling reproducible assessment and targeted improvement of LLMs for SEA languages with practical implications for multilingual NLP research and deployment.

Abstract

In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize LLMs. All datasets and evaluation scripts are public.

Paper Structure

This paper contains 32 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Machine translation results with Chrf++ as evaluation metric.
  • Figure 2: Question Answering results with Exact Match as evaluation metric.
  • Figure 3: Analysis of prediction bias across prompt variants, with PPL-based evaluation approach.
  • Figure 4: The illustration of prompt configuration ${T_o}$. Note that the gray text is NOT used in this configuration.
  • Figure 5: The illustration of prompt configuration ${T_iT_o}$.
  • ...and 3 more figures