CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
Guixian Xu, Zeli Su, Ziyin Zhang, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
TL;DR
This paper targets the data scarcity barrier for headline generation in China's minority languages by introducing CMHG, a large-scale, open dataset with 100k Tibetan and 50k entries for Uyghur and Mongolian, plus a high-quality, native-speaker-annotated test set. It details data collection from government and news sources, rigorous cleaning, and a structured annotation framework with guidelines, quality control, and incentives to ensure reliable benchmarks. The work evaluates both small fine-tuned models and large few-shot LLMs on the CMHG benchmark, revealing strong performance and supporting the dataset's utility for future research in minority-language NLP. It also discusses limitations (resource gaps, language coverage) and outlines directions to expand benchmarks and tasks across underrepresented languages.
Abstract
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.
