Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity

Shanghaoran Quan

Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity

Shanghaoran Quan

TL;DR

AugCon presents a scalable pipeline for automatic construction of multi-granularity, context-driven SFT data for LLMs. It combines Context-Split-Tree to cover varying granularity levels, a contrastive-learning based scorer to rank and filter queries, and a principle-guided self-alignment plus self-improving loop to yield high-fidelity responses, discarding auxiliary materials to produce clean Q-R pairs. Empirical results on a Chinese DailyM scenario and four English benchmarks show consistent improvements over strong baselines in both data quality and downstream fine-tuning performance, with open-source code and data to support reproducibility. This work advances domain-specific LLM customization by enabling high-diversity, high-fidelity SFT data generation at scale, mitigating annotation costs and data homogeneity that plague prior methods.

Abstract

Constructing high-quality query-response pairs from custom corpus is crucial for supervised fine-tuning (SFT) large language models (LLMs) in many applications, like creating domain-specific AI assistants or roleplaying agents. However, sourcing this data through human annotation is costly, and existing automated methods often fail to capture the diverse range of contextual granularity and tend to produce homogeneous data. To tackle these issues, we introduce a novel method named AugCon, capable of automatically generating context-driven SFT data across multiple levels of granularity with high diversity, quality and fidelity. AugCon begins by generating queries using the Context-Split-Tree (CST), an innovative approach for recursively deriving queries and splitting context to cover full granularity. Then, we train a scorer through contrastive learning to collaborate with CST to rank and refine queries. Finally, a synergistic integration of self-alignment and self-improving is introduced to obtain high-fidelity responses. Extensive experiments are conducted incorporating both human and automatic evaluations, encompassing a test scenario and four widely-used benchmarks in English and Chinese. The results highlight the significant advantages of AugCon in producing high diversity, quality, and fidelity SFT data against several state-of-the-art methods. All of our code, dataset, and fine-tuned model will be available at: https://github.com/quanshr/AugCon.

Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity

TL;DR

Abstract

Paper Structure (55 sections, 1 equation, 6 figures, 11 tables, 2 algorithms)

This paper contains 55 sections, 1 equation, 6 figures, 11 tables, 2 algorithms.

Introduction
Our Method: AugCon
Preliminary
Recursively Deriving Queries via Context-Split-Tree
Training Scorer to Rank Queries and Filtering
Obtaining High-Fidelity Responses
Evaluations
Baselines
Human Evaluation
Metrics
Results
Automatic Evaluation
Benchmarks
Metrics
Results
...and 40 more sections

Figures (6)

Figure 1: An overview of the proposed AugCon.
Figure 2: The results of human evaluation on DailyM. Query metrics are not applicable for the base chat model and DAPT so we don't show them.
Figure 3: The schematic of the constructed CST in this case. Each node contains a context and a corresponding question, with the node size indicating different levels of granularity.
Figure 4: The results of GPT-4 judge on three levels of questions.
Figure 5: The training loss and human evaluation results during training phase.
...and 1 more figures

Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity

TL;DR

Abstract

Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity

Authors

TL;DR

Abstract

Table of Contents

Figures (6)