Table of Contents
Fetching ...

Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud

Yuanhao Yue, Chengyu Wang, Jun Huang, Peng Wang

TL;DR

This work tackles the data-intensive challenge of domain-specific LLM fine-tuning by introducing a family of data augmentation models—instruction expansion, instruction refinement, and instruction‑response expansion—that operate on small LLMs to reduce inference costs. It builds an automatic data collection system combining public and in-house seed data, and integrates the augmentation workflow into a cloud-native platform to enable practical, low-cost fine-tuning. Through targeted experiments on reasoning and instruction tasks and application studies on prompt refinement for chatbots, the approach demonstrates improvements in data diversity and model performance while reducing resource requirements. The work offers a pragmatic path to democratize LLM customization by lowering data and compute barriers, with clear considerations of biases and ethical safeguards.

Abstract

Specializing LLMs in various domain-specific tasks has emerged as a critical step towards achieving high performance. However, the construction and annotation of datasets in specific domains are always very costly. Apart from using superior and expensive closed-source LLM APIs to construct datasets, some open-source models have become strong enough to handle dataset construction in many scenarios. Thus, we present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning. These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs: instruction expansion, instruction refinement, and instruction-response pair expansion. To fulfill this goal, we first construct an automatic data collection system with seed datasets generated from both public repositories and our in-house datasets. This system leverages powerful LLMs to expand, refine and re-write the instructions and responses, incorporating quality assessment techniques. Following this, we introduce the training process of our models, which effectively distills task-solving and text synthesis abilities from teacher LLMs. Finally, we demonstrate how we integrate these functionalities into a machine learning platform to support low-cost LLM fine-tuning from both dataset preparation and training perspectives for users. Experiments and an application study prove the effectiveness of our approach.

Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud

TL;DR

This work tackles the data-intensive challenge of domain-specific LLM fine-tuning by introducing a family of data augmentation models—instruction expansion, instruction refinement, and instruction‑response expansion—that operate on small LLMs to reduce inference costs. It builds an automatic data collection system combining public and in-house seed data, and integrates the augmentation workflow into a cloud-native platform to enable practical, low-cost fine-tuning. Through targeted experiments on reasoning and instruction tasks and application studies on prompt refinement for chatbots, the approach demonstrates improvements in data diversity and model performance while reducing resource requirements. The work offers a pragmatic path to democratize LLM customization by lowering data and compute barriers, with clear considerations of biases and ethical safeguards.

Abstract

Specializing LLMs in various domain-specific tasks has emerged as a critical step towards achieving high performance. However, the construction and annotation of datasets in specific domains are always very costly. Apart from using superior and expensive closed-source LLM APIs to construct datasets, some open-source models have become strong enough to handle dataset construction in many scenarios. Thus, we present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning. These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs: instruction expansion, instruction refinement, and instruction-response pair expansion. To fulfill this goal, we first construct an automatic data collection system with seed datasets generated from both public repositories and our in-house datasets. This system leverages powerful LLMs to expand, refine and re-write the instructions and responses, incorporating quality assessment techniques. Following this, we introduce the training process of our models, which effectively distills task-solving and text synthesis abilities from teacher LLMs. Finally, we demonstrate how we integrate these functionalities into a machine learning platform to support low-cost LLM fine-tuning from both dataset preparation and training perspectives for users. Experiments and an application study prove the effectiveness of our approach.

Paper Structure

This paper contains 22 sections, 3 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: The data collection system.
  • Figure 2: A snapshot of the model card.
  • Figure 3: The win-lose-tie rates of Qwen2-7B-Instruct-Refine for the prompt refinement task, compared with the much larger model Qwen-max.
  • Figure 4: We observe that the data generated by Qwen2-7B-Instruct-Response-Exp, compared to data generated by Self-Instruct, occupies a more broadly distributed range of regions within the embedding space after being projected to two dimensions using t-SNE.
  • Figure 5: Distribution of the model expansion and human-written dataset in the embedding space on the Elementary Math dataset. Datasets augmented by our models exhibit substantial regional overlap with the seed dataset, consequently leading to significant overlap with most regions of the validation set. The data generated by the Qwen2-7B-Instruct-Exp is slightly smoother and more uniform compared to that produced by the Qwen2-1.5B-Instruct-Exp.