Table of Contents
Fetching ...

Measuring Taiwanese Mandarin Language Understanding

Po-Heng Chen, Sijia Cheng, Wei-Lin Chen, Yen-Ting Lin, Yun-Nung Chen

TL;DR

The paper addresses the lack of robust evaluation benchmarks for Traditional Chinese in the Taiwanese Mandarin context by introducing TMLU, a 2,981-question MCQ benchmark spanning 37 subjects and five discipline groups, with manually curated chain-of-thought explanations. It evaluates 24 advanced LLMs using both direct-answer and chain-of-thought prompting, employing two answer-extraction methods and contamination analyses to ensure fairness. Key findings show that proprietary multilingual models generally outperform open-weight Chinese models, and Taiwanese Mandarin–tailored models lag behind Simplified-Chinese counterparts, revealing substantial localization headroom. Overall, TMLU provides a localized, robust evaluation framework and resources to spur development of Taiwan-specific LLMs and cross-model comparisons, while acknowledging generation-task limitations and the need for further contamination-focused research.

Abstract

The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.

Measuring Taiwanese Mandarin Language Understanding

TL;DR

The paper addresses the lack of robust evaluation benchmarks for Traditional Chinese in the Taiwanese Mandarin context by introducing TMLU, a 2,981-question MCQ benchmark spanning 37 subjects and five discipline groups, with manually curated chain-of-thought explanations. It evaluates 24 advanced LLMs using both direct-answer and chain-of-thought prompting, employing two answer-extraction methods and contamination analyses to ensure fairness. Key findings show that proprietary multilingual models generally outperform open-weight Chinese models, and Taiwanese Mandarin–tailored models lag behind Simplified-Chinese counterparts, revealing substantial localization headroom. Overall, TMLU provides a localized, robust evaluation framework and resources to spur development of Taiwan-specific LLMs and cross-model comparisons, while acknowledging generation-task limitations and the need for further contamination-focused research.

Abstract

The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.
Paper Structure (20 sections, 1 equation, 7 figures, 9 tables)

This paper contains 20 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: An overview of our proposed TMLU benchmark. TMLU consists of 37 subjects across middle school, high school and professional levels. In addition, TMLU includes a set of Taiwan specific questions in which answers are unique to Taiwanese culture.
  • Figure 2: An example prompt for few-shot direct answer evaluation on TMLU.
  • Figure 3: The Min-k% Probshi2023detecting of six base models on TMMLU-plus tam2024improved and our dataset TMLU. The lower the Min-k% Prob is, the more likely the input instances of the datasets are in the model's pre-training data.
  • Figure 4: Performance difference ($\delta$) between direct answer and CoT prompting on stem subjects. Only models exhibiting improvements ($\delta>0$) are presented.
  • Figure 5: Average accuracy of models on questions of different years. The accuracy is calculated by averaging across the number of questions. Full results are provided at Table \ref{['tab:year_comparison']}.
  • ...and 2 more figures