Table of Contents
Fetching ...

DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation

Ming Wang, Fang Wang, Minghao Hu, Li He, Haiyang Wang, Jun Zhang, Tianwei Yan, Li Li, Zhunchen Luo, Wei Luo, Xiaoying Bai, Guotong Geng

TL;DR

This paper tackles LFAG challenges by introducing DeFine, a hierarchically decomposed, fine-grained annotated dataset crafted through a four-agent data-construction pipeline. The Data Miner, Cite Retriever, Q&A Annotator, and Data Cleaner generate structured outlines, reference abstractions, QA data, and quality controls, enabling granular generation control. The authors validate DeFine by fine-tuning Qwen2-7b-Instruct (as Qwen2-7b-Scribe) and testing three baselines (Web Retrieval, Local Retrieval, Grounded Reference), reporting improvements in topic coverage, depth, and citation reliability. Overall, DeFine demonstrates that task decomposition and retrieval-enhanced generation can sharply improve coherence and factuality in LFAG, while acknowledging limitations such as language balance and evaluation scope.

Abstract

Long-form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine-grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introduce DeFine, a Decomposed and Fine-grained annotated dataset for long-form article generation. DeFine is characterized by its hierarchical decomposition strategy and the integration of domain-specific knowledge with multi-level annotations, ensuring granular control and enhanced depth in article generation. To construct the dataset, a multi-agent collaborative pipeline is proposed, which systematically segments the generation process into four parts: Data Miner, Cite Retreiver, Q&A Annotator and Data Cleaner. To validate the effectiveness of DeFine, we designed and tested three LFAG baselines: the web retrieval, the local retrieval, and the grounded reference. We fine-tuned the Qwen2-7b-Instruct model using the DeFine training dataset. The experimental results showed significant improvements in text quality, specifically in topic coverage, depth of information, and content fidelity. Our dataset publicly available to facilitate future research.

DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation

TL;DR

This paper tackles LFAG challenges by introducing DeFine, a hierarchically decomposed, fine-grained annotated dataset crafted through a four-agent data-construction pipeline. The Data Miner, Cite Retriever, Q&A Annotator, and Data Cleaner generate structured outlines, reference abstractions, QA data, and quality controls, enabling granular generation control. The authors validate DeFine by fine-tuning Qwen2-7b-Instruct (as Qwen2-7b-Scribe) and testing three baselines (Web Retrieval, Local Retrieval, Grounded Reference), reporting improvements in topic coverage, depth, and citation reliability. Overall, DeFine demonstrates that task decomposition and retrieval-enhanced generation can sharply improve coherence and factuality in LFAG, while acknowledging limitations such as language balance and evaluation scope.

Abstract

Long-form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine-grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introduce DeFine, a Decomposed and Fine-grained annotated dataset for long-form article generation. DeFine is characterized by its hierarchical decomposition strategy and the integration of domain-specific knowledge with multi-level annotations, ensuring granular control and enhanced depth in article generation. To construct the dataset, a multi-agent collaborative pipeline is proposed, which systematically segments the generation process into four parts: Data Miner, Cite Retreiver, Q&A Annotator and Data Cleaner. To validate the effectiveness of DeFine, we designed and tested three LFAG baselines: the web retrieval, the local retrieval, and the grounded reference. We fine-tuned the Qwen2-7b-Instruct model using the DeFine training dataset. The experimental results showed significant improvements in text quality, specifically in topic coverage, depth of information, and content fidelity. Our dataset publicly available to facilitate future research.

Paper Structure

This paper contains 53 sections, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: DeFine dataset example. In our dataset, there are three distinct types of data: outline data, abstract set data, and question-answer data. These are utilized to train the model for different functions: generating outlines based on a given article topic, extracting summaries from retrieved references, and generating long-form text based on the extracted summaries and outlines.
  • Figure 2: Overview of the DeFine dataset construction. We utilize four specialized agents to construct the dataset. First, the Data Miner extracts hierarchical outline data from articles. Next, the Cite Retriever summarizes reference content into abstract sets. The Q&A Annotator generates question-answer pairs and applies hallucination detection. Finally, the Data Cleaner ensures the dataset’s quality through rigorous cleaning, focusing on richness, relevance, and coverage.
  • Figure 3: Wikipedia pages distribution across different types.
  • Figure 4: Overview of the three baselines. Each baseline performs distinct tasks: the Web Retrieval baseline integrates real-time web retrieval with a generation model to provide up-to-date information; the Local Retrieval baseline retrieves references from a pre-built knowledge base and generates articles based on these references; the Grounded Reference baseline directly uses original article abstracts to generate content. Notably, the relationship extraction model in all baselines uses the trained BGE-M3 model, and both the generation model and the relationship extraction model are interchangeable, allowing for flexible adaptation to different research needs.
  • Figure 5: This figure illustrates the access distribution of QA data topics in the dataset. The categories ending with ch represent Chinese data, while all other categories are in English. The percentages indicate the proportion of accesses from each topic, highlighting the diversity in the dataset.
  • ...and 1 more figures