Table of Contents
Fetching ...

D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

Weibo Zhou, Lingbo Li, Shangsong Liang

TL;DR

D-SCoRE introduces a training-free, end-to-end pipeline that generates reasoning-rich QA-CoT data from arbitrary texts through document-centric segmentation, explicit/implicit question design, and counterfactual augmentation. By coupling multi-stage quality control with reasoning-centric supervision, it achieves high data quality and diversity while enabling domain adaptation on consumer hardware, outperforming models trained on human-annotated data in many settings. The framework demonstrates strong data efficiency, with substantial gains from implicit reasoning content and robust benefits from heterogeneous quality control across model scales and domains. These findings suggest a scalable path to domain-specific QA SFT, reducing annotation costs while improving multi-step reasoning capabilities in LLMs.

Abstract

The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce $\textbf{D-SCoRE}$, a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.

D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

TL;DR

D-SCoRE introduces a training-free, end-to-end pipeline that generates reasoning-rich QA-CoT data from arbitrary texts through document-centric segmentation, explicit/implicit question design, and counterfactual augmentation. By coupling multi-stage quality control with reasoning-centric supervision, it achieves high data quality and diversity while enabling domain adaptation on consumer hardware, outperforming models trained on human-annotated data in many settings. The framework demonstrates strong data efficiency, with substantial gains from implicit reasoning content and robust benefits from heterogeneous quality control across model scales and domains. These findings suggest a scalable path to domain-specific QA SFT, reducing annotation costs while improving multi-step reasoning capabilities in LLMs.

Abstract

The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce , a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating ocument-centric processing, egmentation, T easoning, and structured xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.

Paper Structure

This paper contains 41 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of D-SCoRE framework: an LLM-driven three-stage process (QA generation, quality control, counterfactual augmentation) with multi-dimensional control over question types and reasoning depth, plus optional pre-/post-processing components.
  • Figure 2: Experimental setup for evaluating D-SCoRE-generated QA data via SFT and downstream performance comparison.
  • Figure 3: F1 and EM performance of Qwen3-4B and Qwen3-8B models fine-tuned on D-SCoRE data with varying implicit--explicit ratios (0%, 20%, ..., 120% implicit) across datasets. Blue/orange lines denote EM/F1 scores; dashed horizontal lines indicate the gold (human-annotated) and nosft (no fine-tuning) baselines.