An Improved Traditional Chinese Evaluation Suite for Foundation Model
Zhi-Rui Tam, Ya-Ting Pai, Yen-Wei Lee, Jun-Da Chen, Wei-Min Chu, Sega Cheng, Hong-Han Shuai
TL;DR
本论文提出 TMMLU+,一个针对传统中文的大型语言模型评估基准,扩展 TMMLU 至66个科目、22,690道题,覆盖小学至专业水平,并包含开发集以支持少样本提示。通过对23个开源中文模型及多家闭源模型的广泛评测,揭示传统中文模型总体落后于简体中文模型、且目前尚未超越人类基线。论文还系统分析了提示策略、词表设计、跨书写体系(繁体/简体)及错误模式对模型性能的影响,发现词表 Fertility 分数与基准性能呈显著相关性,CoT提示在TMMLU+中往往降低表现。研究还比较了简体与传统中文基准的跨域差异,并对未来在传统中文领域发展高质量LLMs提出方向。TMMLU+及其源代码公开,旨在推动对传统中文的深入研究与模型改进。
Abstract
We present TMMLU+, a new benchmark designed for Traditional Chinese language understanding. TMMLU+ is a multi-choice question-answering dataset with 66 subjects from elementary to professional level. It is six times larger and boasts a more balanced subject distribution than its predecessor, Taiwan Massive Multitask Language Understanding (TMMLU). We also benchmark closed-source models and 26 open-weight Chinese large language models (LLMs) of parameters ranging from 1.8B to 72B on the proposed TMMLU+. Our findings reveal that (1.) Traditional Chinese models still trail behind their Simplified Chinese counterparts, highlighting a need for more focused advancements in LLMs catering to Traditional Chinese. (2.) Current LLMs still fall short of human performance in average scores, indicating a potential need for future research to delve deeper into social science and humanities subjects. (3.) Among all the tokenization compression metrics examined, we identify that only the fertility score uniquely demonstrates strong correlations with our benchmark results. We foresee that TMMLU+ will pinpoint areas for future model improvement, thereby narrowing the gap between machine and human linguistic capabilities and supporting researchers in developing Traditional Chinese LLMs. Our dataset, along with the benchmark source code, is accessible at huggingface.co/datasets/ikala/tmmluplus.
