D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation
Weibo Zhou, Lingbo Li, Shangsong Liang
TL;DR
D-SCoRE introduces a training-free, end-to-end pipeline that generates reasoning-rich QA-CoT data from arbitrary texts through document-centric segmentation, explicit/implicit question design, and counterfactual augmentation. By coupling multi-stage quality control with reasoning-centric supervision, it achieves high data quality and diversity while enabling domain adaptation on consumer hardware, outperforming models trained on human-annotated data in many settings. The framework demonstrates strong data efficiency, with substantial gains from implicit reasoning content and robust benefits from heterogeneous quality control across model scales and domains. These findings suggest a scalable path to domain-specific QA SFT, reducing annotation costs while improving multi-step reasoning capabilities in LLMs.
Abstract
The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce $\textbf{D-SCoRE}$, a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.
