Table of Contents
Fetching ...

Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation

Tianyu Zheng, Shuyue Guo, Xingwei Qu, Jiawei Guo, Xinrun Du, Qi Jia, Chenghua Lin, Wenhao Huang, Jie Fu, Ge Zhang

TL;DR

Kun is a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations, Adapting a self-training algorithm based on instruction back-translation and answer polishment to generate a substantial dataset of over a million Chinese instructional data points.

Abstract

In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun

Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation

TL;DR

Kun is a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations, Adapting a self-training algorithm based on instruction back-translation and answer polishment to generate a substantial dataset of over a million Chinese instructional data points.

Abstract

In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for large language models (LLMs) without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun
Paper Structure (25 sections, 12 figures, 3 tables)

This paper contains 25 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of Answer Polishment. Initially, the Yi base model is fine-tuned using quality seed instruction data to create a label and a primary chat model. The label model then annotates a large amount of primary data, turning it into labeled data. This is filtered and refined by rules and the primary chat model, producing the final dataset. This dataset is used to further train the primary chat model, resulting in an highly efficient final chat model.
  • Figure 2: The top 10 categories in each of these three areas: Academic Disciplines, Industry Sectors, Text Type
  • Figure 2: The proportion of identical evaluations from three assessors on a single dimension. All: The proportion of consistent assessments across all three dimensions within the same item.
  • Figure 3: Length distribution of instructions and outputs based on Yi-6B model
  • Figure 4: Filter prompt we use to screen out unsuitable content for instructions.
  • ...and 7 more figures