Table of Contents
Fetching ...

Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Yuxin Jiang, Yufei Wang, Chuhan Wu, Xinyi Dai, Yan Xu, Weinan Gan, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang

TL;DR

Web Reconstruction (WebR) introduces a fully automated framework to synthesize high-quality instruction-tuning data from raw web documents using a dual-perspective paradigm: Web as Instruction and Web as Response. It yields two datasets, WebR-Basic and WebR-Pro, that, when used to fine-tune various LLMs, outperform state-of-the-art baselines by up to 16.65 percent across multiple instruction-following benchmarks and demonstrate strong compatibility, data efficiency, scalability, and domain adaptability. The approach relies on minimal assumptions about web content, uses a persona-driven instruction synthesis strategy, and employs a two-branch reconstruction process to maximize diversity and quality without manual labeling. The work provides a practical, cost-effective path to scalable IT data generation with open-source code for broader adoption and further research in domain-specific adaptation and alignment methods.

Abstract

The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at https://github.com/YJiangcm/WebR.

Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

TL;DR

Web Reconstruction (WebR) introduces a fully automated framework to synthesize high-quality instruction-tuning data from raw web documents using a dual-perspective paradigm: Web as Instruction and Web as Response. It yields two datasets, WebR-Basic and WebR-Pro, that, when used to fine-tune various LLMs, outperform state-of-the-art baselines by up to 16.65 percent across multiple instruction-following benchmarks and demonstrate strong compatibility, data efficiency, scalability, and domain adaptability. The approach relies on minimal assumptions about web content, uses a persona-driven instruction synthesis strategy, and employs a two-branch reconstruction process to maximize diversity and quality without manual labeling. The work provides a practical, cost-effective path to scalable IT data generation with open-source code for broader adoption and further research in domain-specific adaptation and alignment methods.

Abstract

The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at https://github.com/YJiangcm/WebR.

Paper Structure

This paper contains 32 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Our proposed Web Reconstruction method surpasses previous techniques by being (1) fully automated, eliminating the need for manual intervention or seed data; (2) minimally reliant on assumptions about the structure and content of web documents; and (3) capable of generating high-quality IT data.
  • Figure 2: Overview of the proposed Web Reconstruction (WebR) framework. Leveraging an off-the-shelf LLM, WebR transforms raw web documents into high-quality instruction-response pairs. It strategically assigns each document as either an instruction or a response to trigger the process of web reconstruction.
  • Figure 3: Statistics of instruction quality and difficulty.
  • Figure 4: The impact of training data scale on the average instruction-following performance.
  • Figure 5: Lengths of instructions and responses in WebR-Basic and WebR-Pro.
  • ...and 6 more figures