Table of Contents
Fetching ...

Transferable text data distillation by trajectory matching

Rong Yao, Hailin Hu, Yifei Fu, Hanting Chen, Wenyi Fang, Fanyi Du, Kai Han, Yunhe Wang

TL;DR

This work tackles the escalating data demands of large-language-model training by introducing neighbor-aware corpus distillation (NACD), a text data distillation method that learns pseudo prompt data via trajectory matching and neighbor-based regularization. By extracting long-range expert trajectories from full-data training and distilling them into a small prompt-embedding dataset, NACD enables instruction tuning with much reduced data while preserving or surpassing full-data performance. The method demonstrates cross-architecture transfer (OPT to Llama) and outperforms strong data-selection baselines like LESS on ARC-Easy and MMLU, with notable gains at 5% data. The approach offers practical data compression benefits for LLM training and suggests avenues for extending text distillation to more NLP tasks and multimodal settings.

Abstract

In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).

Transferable text data distillation by trajectory matching

TL;DR

This work tackles the escalating data demands of large-language-model training by introducing neighbor-aware corpus distillation (NACD), a text data distillation method that learns pseudo prompt data via trajectory matching and neighbor-based regularization. By extracting long-range expert trajectories from full-data training and distilling them into a small prompt-embedding dataset, NACD enables instruction tuning with much reduced data while preserving or surpassing full-data performance. The method demonstrates cross-architecture transfer (OPT to Llama) and outperforms strong data-selection baselines like LESS on ARC-Easy and MMLU, with notable gains at 5% data. The approach offers practical data compression benefits for LLM training and suggests avenues for extending text distillation to more NLP tasks and multimodal settings.

Abstract

In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: (a): Illustration of our method. Step 1, we train the LLM with LoRA using full text data and saving all the intermediate parameters as expert trajectories. Step 2, given a small selection of text data $\mathcal{D}_{sel}$ (chosen by any data selection method), the synthesize prompt dataset $\mathcal{D}_{syn}$ is learned by trajectory matching aims to fit the learning trajectories of full text data. Step 3, after obtaining the prompt embedding, it is concatenated with the text data in the sequence dimension to accomplish instruction tuning. Step 4, evaluating the learned LLM in target data. (b): A trajectory matching example. The blue trajectory represents parameter updates during training with full text data, while the red trajectory represents parameter updates during training with a subset of synthetic text data over N steps
  • Figure 2: Framework of our proposed text distillation process. We concatenate the learned pseudo token embeddings and the raw text data, and perform the LLM parameter trajectory matching by ensuring the student net has a similar model parameter updating to the experts.
  • Figure 3: Samples to show the effects of distilled pseudo tokens.