Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Jing Zhou; Chenglin Jiang; Wei Shen; Xiao Zhou; Xiaonan He

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Jing Zhou, Chenglin Jiang, Wei Shen, Xiao Zhou, Xiaonan He

TL;DR

This work tackles the data bottleneck in domain-specific fine-tuning by turning noisy web-crawled math problems into high-quality training data through a model-based rewriting pipeline aligned with a seed high-quality dataset. By constructing paired <low-quality, high-quality> examples and training a dedicated transformer to rewrite crawled content, the authors demonstrate that Fine-Tuning with cleaned web data plus seed data yields substantial gains on Chinese elementary-math benchmarks, including an average improvement of $9.4\%$ over baselines. A 7B parameter model trained with this approach outperforms several open-source models larger than 32B and even surpasses GPT-3.5 in experiments, underscoring the method’s data-efficiency and practical impact. The study also positions the method within a RAG-like training paradigm, suggesting broad applicability to other domains by leveraging abundant web data and modest seed datasets to achieve high-quality supervised fine-tuning without relying on advanced LLMs like GPT-4.

Abstract

Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms several open-source models larger than 32B and surpasses well-known closed-source models such as GPT-3.5, highlighting the efficacy of our approach.

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

TL;DR

over baselines. A 7B parameter model trained with this approach outperforms several open-source models larger than 32B and even surpasses GPT-3.5 in experiments, underscoring the method’s data-efficiency and practical impact. The study also positions the method within a RAG-like training paradigm, suggesting broad applicability to other domains by leveraging abundant web data and modest seed datasets to achieve high-quality supervised fine-tuning without relying on advanced LLMs like GPT-4.

Abstract

Paper Structure (38 sections, 1 equation, 4 figures, 10 tables)

This paper contains 38 sections, 1 equation, 4 figures, 10 tables.

Introduction
Related Work
Large Language Models for Mathematical Reasoning
Is GPT4 Generated Data Enough?
Methods for Generating Synthetic Data
Methods
Settings
Training Data Sets.
A Close Look at Web-Crawled Data
Misleading Caused by Formatting Issues.
The Drawbacks of Rule-Based Methods
Feasibility of Model-based Methods
A Simple and Effective Method for Data Cleaning
Experiments
Experimental Setup
...and 23 more sections

Figures (4)

Figure 1: An example of web-crawled data. The positional information of superscripts "2" is lost, thus leading to incorrect mathematical expressions.
Figure 2: An example of a web-crawled sample with "local errors" and "global errors". The "local errors" are denoted in blue, and the "global errors" are in red.
Figure 3: An illustration of our proposed data transforming architecture. The answer coloured in green is matched, resulting in a <web-crawled, high-quality> data pair. The text in red is originally wrong and needs to be corrected. We then prompt the paired data to train a re-generation language model to convert the web-crawled data into high-quality ones. Finally, we train a Math LLM using both the high-quality data and the cleaned web-crawled data.
Figure 4: Comparison between rule-based and model-based method on Ape210K, as training data grows. The figure left is the results on ChatGLM and the figure right is the results on Qwen. The horizontal axis represents the amount of SFT data, and the vertical axis represents the accuracy on Ape210K.

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

TL;DR

Abstract

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)