Table of Contents
Fetching ...

CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

Chengwei Wu, Jiapu Wang, Mingyang Gao, Xingrui Zhuo, Jipeng Guo, Runlin Lei, Haoran Luo, Tianyu Chen, Haoyi Zhou, Shirui Pan, Zechao Li

TL;DR

This work introduces CDTP, a large-scale Chinese Data-Text Pair dataset with over 7 million text samples aligned to 15 million triples across four domains, designed to address the lack of structured signals in Chinese corpora. Built atop CDTP, the CB-ECLLM benchmark evaluates Chinese LLMs on knowledge-driven tasks—Knowledge Graph Completion (KGC), Triple-to-Text (T2T) generation, and Question Answering (QA)—employing rigorous data collection, processing, and multi-task fine-tuning assessments. The authors conduct extensive experiments across eight Chinese LLMs to analyze effectiveness, SFT benefits, and robustness to out-of-distribution data, revealing clear gains from supervised fine-tuning and stronger performance for larger models in structured tasks. The work provides an open-source framework for reproducible evaluation and identifies future directions, including broader domain coverage and cross-modal extension, to advance robust knowledge-grounded Chinese language understanding.

Abstract

Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.

CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

TL;DR

This work introduces CDTP, a large-scale Chinese Data-Text Pair dataset with over 7 million text samples aligned to 15 million triples across four domains, designed to address the lack of structured signals in Chinese corpora. Built atop CDTP, the CB-ECLLM benchmark evaluates Chinese LLMs on knowledge-driven tasks—Knowledge Graph Completion (KGC), Triple-to-Text (T2T) generation, and Question Answering (QA)—employing rigorous data collection, processing, and multi-task fine-tuning assessments. The authors conduct extensive experiments across eight Chinese LLMs to analyze effectiveness, SFT benefits, and robustness to out-of-distribution data, revealing clear gains from supervised fine-tuning and stronger performance for larger models in structured tasks. The work provides an open-source framework for reproducible evaluation and identifies future directions, including broader domain coverage and cross-modal extension, to advance robust knowledge-grounded Chinese language understanding.

Abstract

Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.

Paper Structure

This paper contains 30 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The example of our proposed Chinese Data-Text Pair (CDTP) dataset.
  • Figure 2: Performance comparison between KGC and QA on CDTP_HP.
  • Figure 3: Ranking of 8 base models on CDTP_HP across three tasks. Lager values indicating better performance.
  • Figure 4: Performance comparison of SFT and Base model on different datasets in T2T task.
  • Figure 5: Comparison of robustness between base models and SFT models under the out-of-distribution (OOD) data.
  • ...and 5 more figures