Table of Contents
Fetching ...

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Jing Zhou, Chenglin Jiang, Wei Shen, Xiao Zhou, Xiaonan He

TL;DR

This work tackles the data bottleneck in domain-specific fine-tuning by turning noisy web-crawled math problems into high-quality training data through a model-based rewriting pipeline aligned with a seed high-quality dataset. By constructing paired <low-quality, high-quality> examples and training a dedicated transformer to rewrite crawled content, the authors demonstrate that Fine-Tuning with cleaned web data plus seed data yields substantial gains on Chinese elementary-math benchmarks, including an average improvement of $9.4\%$ over baselines. A 7B parameter model trained with this approach outperforms several open-source models larger than 32B and even surpasses GPT-3.5 in experiments, underscoring the method’s data-efficiency and practical impact. The study also positions the method within a RAG-like training paradigm, suggesting broad applicability to other domains by leveraging abundant web data and modest seed datasets to achieve high-quality supervised fine-tuning without relying on advanced LLMs like GPT-4.

Abstract

Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms several open-source models larger than 32B and surpasses well-known closed-source models such as GPT-3.5, highlighting the efficacy of our approach.

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

TL;DR

This work tackles the data bottleneck in domain-specific fine-tuning by turning noisy web-crawled math problems into high-quality training data through a model-based rewriting pipeline aligned with a seed high-quality dataset. By constructing paired <low-quality, high-quality> examples and training a dedicated transformer to rewrite crawled content, the authors demonstrate that Fine-Tuning with cleaned web data plus seed data yields substantial gains on Chinese elementary-math benchmarks, including an average improvement of over baselines. A 7B parameter model trained with this approach outperforms several open-source models larger than 32B and even surpasses GPT-3.5 in experiments, underscoring the method’s data-efficiency and practical impact. The study also positions the method within a RAG-like training paradigm, suggesting broad applicability to other domains by leveraging abundant web data and modest seed datasets to achieve high-quality supervised fine-tuning without relying on advanced LLMs like GPT-4.

Abstract

Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms several open-source models larger than 32B and surpasses well-known closed-source models such as GPT-3.5, highlighting the efficacy of our approach.
Paper Structure (38 sections, 1 equation, 4 figures, 10 tables)

This paper contains 38 sections, 1 equation, 4 figures, 10 tables.

Figures (4)

  • Figure 1: An example of web-crawled data. The positional information of superscripts "2" is lost, thus leading to incorrect mathematical expressions.
  • Figure 2: An example of a web-crawled sample with "local errors" and "global errors". The "local errors" are denoted in blue, and the "global errors" are in red.
  • Figure 3: An illustration of our proposed data transforming architecture. The answer coloured in green is matched, resulting in a <web-crawled, high-quality> data pair. The text in red is originally wrong and needs to be corrected. We then prompt the paired data to train a re-generation language model to convert the web-crawled data into high-quality ones. Finally, we train a Math LLM using both the high-quality data and the cleaned web-crawled data.
  • Figure 4: Comparison between rule-based and model-based method on Ape210K, as training data grows. The figure left is the results on ChatGLM and the figure right is the results on Qwen. The horizontal axis represents the amount of SFT data, and the vertical axis represents the accuracy on Ape210K.