Table of Contents
Fetching ...

Language Models as Continuous Self-Evolving Data Engineers

Peidong Wang, Ming Wang, Zhiming Ma, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song

TL;DR

Problem: data quality and availability limit post-training of LLMs. Approach: LANCE enables LLMs to autonomously generate, review, and annotate data with preference information, forming a full post-training data lifecycle without external models. Findings: LANCE achieves average improvements of 3.64 on Qwen2-7B and 1.75 on Qwen2-7B-Instruct across benchmarks, with strong gains in mathematical reasoning and cross-lingual transfer. Significance: reduces reliance on human-labeled data, lowers cost, and demonstrates a step toward continuous self-evolution and potential superintelligence.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on expert-labeled data, setting a ceiling on the performance of LLMs. To address this issue, we propose a novel paradigm named LANCE (LANguage models as Continuous self-Evolving data engineers) that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data with preference information. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction. Through iterative fine-tuning on Qwen2 series models, we validate the effectiveness of LANCE across various tasks, showing that it can maintain high-quality data generation and continuously improve model performance. Across multiple benchmark dimensions, LANCE results in an average score enhancement of 3.64 for Qwen2-7B and 1.75 for Qwen2-7B-Instruct. This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities. Codes are available at: https://github.com/Control-derek/LANCE.

Language Models as Continuous Self-Evolving Data Engineers

TL;DR

Problem: data quality and availability limit post-training of LLMs. Approach: LANCE enables LLMs to autonomously generate, review, and annotate data with preference information, forming a full post-training data lifecycle without external models. Findings: LANCE achieves average improvements of 3.64 on Qwen2-7B and 1.75 on Qwen2-7B-Instruct across benchmarks, with strong gains in mathematical reasoning and cross-lingual transfer. Significance: reduces reliance on human-labeled data, lowers cost, and demonstrates a step toward continuous self-evolution and potential superintelligence.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on expert-labeled data, setting a ceiling on the performance of LLMs. To address this issue, we propose a novel paradigm named LANCE (LANguage models as Continuous self-Evolving data engineers) that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data with preference information. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction. Through iterative fine-tuning on Qwen2 series models, we validate the effectiveness of LANCE across various tasks, showing that it can maintain high-quality data generation and continuously improve model performance. Across multiple benchmark dimensions, LANCE results in an average score enhancement of 3.64 for Qwen2-7B and 1.75 for Qwen2-7B-Instruct. This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities. Codes are available at: https://github.com/Control-derek/LANCE.

Paper Structure

This paper contains 30 sections, 6 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: An illustration of our methodology. Traditional ML focuses on the setting where humans supervise models that are weaker than humans. Our methodology explores the scenario where models self-supervise, which may be a reliable path to superintelligence.
  • Figure 2: Overview of LANCE. The cycle begins at $t=0$ with pre-annotated seed dataset $Seed_{0}$. At each time step $t$, model $M_t$ generates new instruction and preference data from $Seed_{t}$ via Post-training data construction full-cycle. $M_t$ is fine-tuned on instruction data (NLL) to create $M_t^S$, then on preference data (PLR) to produce $M_t^D$. In the next iteration, $M_t^D$ becomes $M_{t+1}$, and new samples are merged into $Seed_{t}$ to form $Seed_{t+1}$.
  • Figure 3: Various self-evolution methods show average scores across benchmarks. The Self-Instruct method, without iterative processes, sampled 50k examples for self-training. "Iter $t$" denotes the $t$-th iteration.
  • Figure 4: Visualization of the distribution of seed data and synthetic data generated by LANCE
  • Figure 5: An example of SFT data generation based on seed data using LANCE.
  • ...and 1 more figures