DoG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping

Yongrui Chen; Haiyun Jiang; Xinting Huang; Shuming Shi; Guilin Qi

DoG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping

Yongrui Chen, Haiyun Jiang, Xinting Huang, Shuming Shi, Guilin Qi

TL;DR

This paper addresses the challenge of obtaining high-quality instruction-response data by leveraging human-written documents as factual grounding and introducing a document-grounded instruction wrapper (DoG-Instruct). The authors design a two-stage pipeline: (i) build a meta-training set with alignment and diversity using GPT-4 to train an open-source wrapper, and (ii) apply the wrapper to a broad, multi-domain document corpus to generate DoG-Instruct data with post-processing to filter noise. Empirical results on AlpacaEval and other benchmarks show that the wrapper-trained model achieves state-of-the-art or competitive performance with far less training data, and human evaluation confirms reduced hallucination and strong fluency. The method offers a scalable path to premium instruction-tuning data and highlights the value of grounding model outputs in real documents while controlling style and content through learned transformation.

Abstract

The improvement of LLMs' instruction-following capabilities relies heavily on the availability of high-quality instruction-response pairs. Unfortunately, the current methods used to collect the pairs suffer from either unaffordable labor costs or severe hallucinations in the self-generation of LLM. To tackle these challenges, this paper proposes a scalable solution. It involves training LLMs to generate instruction-response pairs based on human-written documents, rather than relying solely on self-generation without context. Our proposed method not only exploits the advantages of human-written documents in reducing hallucinations but also utilizes an LLM to wrap the expression of documents, which enables us to bridge the gap between various document styles and the standard AI response. Experiments demonstrate that our method outperforms existing typical methods on multiple benchmarks. In particular, compared to the best-performing baseline, the LLM trained using our generated dataset exhibits a 10\% relative improvement in performance on AlpacaEval, despite utilizing only 1/5 of its training data. Furthermore, a comprehensive manual evaluation validates the quality of the data we generated. Our trained wrapper is publicly available at https://github.com/Bahuia/Dog-Instruct.

DoG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 5 figures, 6 tables)

This paper contains 21 sections, 1 equation, 5 figures, 6 tables.

Introduction
Problem Formulation
Collection of DoG-Instruct Data
Corpus & Document Sampling
Instruction Wrapper Building
Data Generation via Instruction Wrapper
DoG-Instruct Statistics
Experiments
Experimental Setup
Automatic Evaluation
AlpacaEval Results.
ELI5, LF-Test and Super-NI Results.
Ablation Study.
Human Evaluation
Data Quality
...and 6 more sections

Figures (5)

Figure 1: Differences between our proposed instruction wrapping with instruction back-translationDBLP:journals/corr/abs-2304-08460DBLP:journals/corr/abs-2308-06259. Red text is not appropriate for responses. Blue text indicates that the original text has been added, deleted, or rewritten by LLM to align more closely with the desired standardized response.
Figure 2: Overview of DoG-Instruct construction process. In stage a), a meta-training set $\Omega$ is constructed using GPT-4 and utilized to train the instruction wrapper. In stage b), the wrapper generates instruction-response pairs for each sampled document, and a post-processing strategy is employed to filter out invalid examples.
Figure 3: Instruction diversity of DoG-Instruct data. The inner circle shows common root verbs with the corresponding common noun objects in the outer circle.
Figure 4: GPT-4 automatic evaluation results on subsets of Eli5 (left), LF-Test (middle), Super-NI (right). To account for the cost of GPT-4, each subset contains 200 examples that randomly sampled from the original test sets. The win/tie/lose rates are computed by comparing the model responses with the given reference responses.
Figure 5: Human evaluation comparing DoG-Instruct with various text-grounded methods. The evaluation was carried out using the same set of human-written documents as input for all methods.

DoG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping

TL;DR

Abstract

DoG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping

Authors

TL;DR

Abstract

Table of Contents

Figures (5)