Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud
Yuanhao Yue, Chengyu Wang, Jun Huang, Peng Wang
TL;DR
This work tackles the data-intensive challenge of domain-specific LLM fine-tuning by introducing a family of data augmentation models—instruction expansion, instruction refinement, and instruction‑response expansion—that operate on small LLMs to reduce inference costs. It builds an automatic data collection system combining public and in-house seed data, and integrates the augmentation workflow into a cloud-native platform to enable practical, low-cost fine-tuning. Through targeted experiments on reasoning and instruction tasks and application studies on prompt refinement for chatbots, the approach demonstrates improvements in data diversity and model performance while reducing resource requirements. The work offers a pragmatic path to democratize LLM customization by lowering data and compute barriers, with clear considerations of biases and ethical safeguards.
Abstract
Specializing LLMs in various domain-specific tasks has emerged as a critical step towards achieving high performance. However, the construction and annotation of datasets in specific domains are always very costly. Apart from using superior and expensive closed-source LLM APIs to construct datasets, some open-source models have become strong enough to handle dataset construction in many scenarios. Thus, we present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning. These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs: instruction expansion, instruction refinement, and instruction-response pair expansion. To fulfill this goal, we first construct an automatic data collection system with seed datasets generated from both public repositories and our in-house datasets. This system leverages powerful LLMs to expand, refine and re-write the instructions and responses, incorporating quality assessment techniques. Following this, we introduce the training process of our models, which effectively distills task-solving and text synthesis abilities from teacher LLMs. Finally, we demonstrate how we integrate these functionalities into a machine learning platform to support low-cost LLM fine-tuning from both dataset preparation and training perspectives for users. Experiments and an application study prove the effectiveness of our approach.
