Creation of the Chinese Adaptive Policy Communication Corpus
Bolun Sun, Charles Chang, Yuen Yuen Ang, Pingxu Hao, Ruotong Mu, Yuchen Xu, Zhengxin Zhang
TL;DR
CAPC-CG addresses the scarcity of open, high-quality Chinese policy texts by building a large, paragraph-level corpus of central-government directives (1949–2023) annotated with a five-color taxonomy derived from adaptive policy communication. The work combines expert gold-standard labeling with a two-round workflow, achieving inter-annotator reliability of $\kappa=0.86$ and enabling scalable modeling via fine-tuned LLMs, while offering a cost-efficient segmentation pipeline. It provides rich metadata, a relational data schema, and robust baseline results for both Level-1 and Level-2 classification, demonstrating strong potential for downstream policy analysis and NLP applications. By linking political science theory with practical NLP methods, CAPC-CG supports diachronic, cross-regime, and empirical studies of adaptive policy communication and the political economy of policy implementation.
Abstract
We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.
