Table of Contents
Fetching ...

BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat

TL;DR

The first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies is introduced, showing that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone.

Abstract

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY

BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

TL;DR

The first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies is introduced, showing that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone.

Abstract

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY
Paper Structure (25 sections, 7 figures, 9 tables)

This paper contains 25 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: BURMESE-SAN Benchmark (Left) and Dataset Curation Process for the benchmark (Right).BURMESE-SAN is a benchmark that holistically evaluates LLM performance across a wide range of Burmese language tasks. The evaluation is based on native Burmese text, with prompts written in formal Burmese to ensure clarity and grammatical correctness.
  • Figure 2: Left: Comparison of original models against SEA-fine-tuned variants, and Right: SEA-LION models with their quantized versions - NVIDIA FP4 (NVFP4) and Dynamic FP8 (DynFP8).
  • Figure 3: Acceptable and Not Acceptable Grammar Errors in the Dataset.
  • Figure 4: Different Types of Spelling Errors.
  • Figure 5: Examples of variation in Burmese translations by native speakers. (a) Differences in technical term usage, transliteration, and word order. (b) Differences in particle choice and formality.
  • ...and 2 more figures