Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark
Feng Jiang, Weihao Liu, Xiaomin Chu, Peifeng Li, Qiaoming Zhu, Haizhou Li
TL;DR
This work tackles the shortage of Chinese paragraph-level topic-structure resources by introducing CPTS, a three-layer representation with title as the supertopic, subheadings as subtopics, and paragraphs as basic topics, and a two-stage man-machine annotation pipeline that yields 14393 documents with high annotation quality ($\text{IAA}=94.79\%$, $\kappa=0.849$). It constructs CPTS and benchmarks its computability on topic segmentation and outline generation, using baselines including ChatGPT. The results show that fine-tuned models achieve strong segmentation and outline performance, while ChatGPT remains competitive in generation tasks, and that incorporating CPTS improves discourse parsing on MCDTB. Overall, CPTS provides a large, rich Chinese paragraph-level topic-structure resource that supports downstream tasks and LLM-assisted content control; its open access enables broader research and applications.
Abstract
Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings, unveiling the discourse topic structure of a document. Compared with sentence-level topic structure, the paragraph-level topic structure can quickly grasp and understand the overall context of the document from a higher level, benefitting many downstream tasks such as summarization, discourse parsing, and information retrieval. However, the lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained relative research and applications. To fill this gap, we build the Chinese paragraph-level topic representation, corpus, and benchmark in this paper. Firstly, we propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. Then, we employ a two-stage man-machine collaborative annotation method to construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), achieving high quality. We also build several strong baselines, including ChatGPT, to validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) and preliminarily verified its usefulness for the downstream task (discourse parsing).
