Table of Contents
Fetching ...

TLUE: A Tibetan Language Understanding Evaluation Benchmark

Fan Gao, Cheng Huang, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Hao Wang Xiao Feng, Yongbin Yu

TL;DR

TLUE introduces the first large-scale Tibetan language understanding benchmark, addressing a critical gap for low-resource languages in LLM evaluation. It combines Ti-MMLU, a 67-subdomain knowledge assessment, with Ti-SafetyBench, a 7-category safety evaluation, and relies on translation-based adaptation plus expert curation to preserve linguistic and cultural fidelity. Across a diverse mix of open-source and proprietary LLMs, results reveal pronounced performance gaps relative to random baselines, with STEM and safety alignment proving particularly challenging in Tibetan and with notable domain shifts from high-resource benchmarks. The work underscores the need for language-specific pretraining data, targeted fine-tuning, and culturally aware safety alignment, positioning TLUE as a foundational resource for advancing equitable Tibetan language AI and broader low-resource language evaluation.

Abstract

Large language models have made tremendous progress in recent years, but low-resource languages, like Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of large language models. To address this gap, we present a \textbf{T}ibetan \textbf{L}anguage \textbf{U}nderstanding \textbf{E}valuation Benchmark, \textbf{TLUE}, the first large-scale benchmark for measuring the proficiency of LLMs in the Tibetan language. \textbf{TLUE} comprises two major components: a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and a safety benchmark encompassing 7 subdomains. Then, we evaluate a diverse set of state-of-the-art large language models. Experimental results demonstrate that most large language models perform below the random baseline, highlighting the considerable challenges they face in Tibetan language processing. \textbf{TLUE} provides a crucial foundation for advancing future research in Tibetan language understanding and highlights the importance of promoting greater inclusivity in the development of large language models.

TLUE: A Tibetan Language Understanding Evaluation Benchmark

TL;DR

TLUE introduces the first large-scale Tibetan language understanding benchmark, addressing a critical gap for low-resource languages in LLM evaluation. It combines Ti-MMLU, a 67-subdomain knowledge assessment, with Ti-SafetyBench, a 7-category safety evaluation, and relies on translation-based adaptation plus expert curation to preserve linguistic and cultural fidelity. Across a diverse mix of open-source and proprietary LLMs, results reveal pronounced performance gaps relative to random baselines, with STEM and safety alignment proving particularly challenging in Tibetan and with notable domain shifts from high-resource benchmarks. The work underscores the need for language-specific pretraining data, targeted fine-tuning, and culturally aware safety alignment, positioning TLUE as a foundational resource for advancing equitable Tibetan language AI and broader low-resource language evaluation.

Abstract

Large language models have made tremendous progress in recent years, but low-resource languages, like Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of large language models. To address this gap, we present a \textbf{T}ibetan \textbf{L}anguage \textbf{U}nderstanding \textbf{E}valuation Benchmark, \textbf{TLUE}, the first large-scale benchmark for measuring the proficiency of LLMs in the Tibetan language. \textbf{TLUE} comprises two major components: a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and a safety benchmark encompassing 7 subdomains. Then, we evaluate a diverse set of state-of-the-art large language models. Experimental results demonstrate that most large language models perform below the random baseline, highlighting the considerable challenges they face in Tibetan language processing. \textbf{TLUE} provides a crucial foundation for advancing future research in Tibetan language understanding and highlights the importance of promoting greater inclusivity in the development of large language models.

Paper Structure

This paper contains 52 sections, 16 figures, 18 tables.

Figures (16)

  • Figure 1: The Curation Process of TLUE
  • Figure 2: LLMs' Performance Degradation from CMMLU li2023cmmlu to Ti-MMLU
  • Figure 3: LLMs' Performance Degradation from SafetyBench zhang2023safetybench to Ti-SafetyBench
  • Figure 4: Average accuracy on TLUE across different model scales for LlaMA-3.1 dubey2024LlaMA and Qwen-2.5 Qwen-2.5.
  • Figure 5: Overview of the TLUE Benchmark
  • ...and 11 more figures