CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

Yilun Liu; Shimin Tao; Xiaofeng Zhao; Ming Zhu; Wenbing Ma; Junhao Zhu; Chang Su; Yutai Hou; Miao Zhang; Min Zhang; Hongxia Ma; Li Zhang; Hao Yang; Yanfei Jiang

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

Yilun Liu, Shimin Tao, Xiaofeng Zhao, Ming Zhu, Wenbing Ma, Junhao Zhu, Chang Su, Yutai Hou, Miao Zhang, Min Zhang, Hongxia Ma, Li Zhang, Hao Yang, Yanfei Jiang

TL;DR

This paper proposes CoachLM, a novel approach to enhance the quality of instruction datasets through automatic revisions on samples in the dataset, trained from the samples revised by human experts and significantly increases the proportion of high-quality samples in the dataset.

Abstract

Instruction tuning is crucial for enabling Language Learning Models (LLMs) in responding to human instructions. The quality of instruction pairs used for tuning greatly affects the performance of LLMs. However, the manual creation of high-quality instruction datasets is costly, leading to the adoption of automatic generation of instruction pairs by LLMs as a popular alternative. To ensure the high quality of LLM-generated instruction datasets, several approaches have been proposed. Nevertheless, existing methods either compromise dataset integrity by filtering a large proportion of samples, or are unsuitable for industrial applications. In this paper, instead of discarding low-quality samples, we propose CoachLM, a novel approach to enhance the quality of instruction datasets through automatic revisions on samples in the dataset. CoachLM is trained from the samples revised by human experts and significantly increases the proportion of high-quality samples in the dataset from 17.7% to 78.9%. The effectiveness of CoachLM is further assessed on various real-world instruction test sets. The results show that CoachLM improves the instruction-following capabilities of the instruction-tuned LLM by an average of 29.9%, which even surpasses larger LLMs with nearly twice the number of parameters. Furthermore, CoachLM is successfully deployed in a data management system for LLMs at Huawei, resulting in an efficiency improvement of up to 20% in the cleaning of 40k real-world instruction pairs. We release various assets of CoachLM, including the training data, code and test set (https://github.com/lunyiliu/CoachLM).

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

TL;DR

Abstract

Paper Structure (42 sections, 2 equations, 6 figures, 11 tables)

This paper contains 42 sections, 2 equations, 6 figures, 11 tables.

Introduction
Methodology
Motivation
Overview of CoachLM
Profile of Involved Language Experts
Quality Evaluation Criteria for Instruction Pairs
Manual Instruction Revision with Experts
Preliminary Filtering
Expert Revision
Design of CoachLM
Coach Instruction Tuning
Quality Control of Human Input
Automatic Revision with CoachLM
CoachLM150 Test Set
Experiments and Evaluations
...and 27 more sections

Figures (6)

Figure 1: Illustration of instruction tuning LLMs on pairs of Instruction and Response.
Figure 2: Illustration of CoachLM: (a) in the training stage and (b) in the inference stage. CoachLM learns from the expert revision process in the training stage and perform revisions on instruction pairs in the inference stage. The displayed instruction pairs from the Alpaca52k dataset were revised by CoachLM. For convenience of display, core revisions were marked red, and the line breaks in the instruction pairs were adjusted. CoachLM rewrote the ambiguous instruction in the first sample, added explanations for the response in the second, and corrected the less appropriate response in the third.
Figure 3: Illustration on format of the instruction pairs $x_c$ in the coach instruction tuning. $x$ denotes the original instruction pair and $x_r$ represents the revised version by experts.
Figure 4: Histogram of ratings by ChatGPT on the whole Alpaca52k dataset before and after CoachLM revision.
Figure 5: Win rates of (a) Alpaca-CoachLM and (b) Alpaca-human against reference responses in the CoachLM150 test set with varying human input ratio $\alpha$, rated by GPT-4 and PandaLM. $\alpha$ represents ratio of human input used for training, with amount of human revision sorted from largest to smallest. $\alpha$=0 means no human input in training and $\alpha$=1 means the full human input is used. The displayed win rate is the average of WR1, WR2 and QS.
...and 1 more figures

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

TL;DR

Abstract

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)