Table of Contents
Fetching ...

On Training Data Influence of GPT Models

Yekun Chai, Qingyi Liu, Shuohuan Wang, Yu Sun, Qiwei Peng, Hua Wu

TL;DR

GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation.

Abstract

Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available at https://github.com/ernie-research/gptfluence.

On Training Data Influence of GPT Models

TL;DR

GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation.

Abstract

Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available at https://github.com/ernie-research/gptfluence.
Paper Structure (60 sections, 8 equations, 29 figures, 13 tables, 1 algorithm)

This paper contains 60 sections, 8 equations, 29 figures, 13 tables, 1 algorithm.

Figures (29)

  • Figure 1: Overview of GPTfluence. Step 1: We sample training data to create curricula for training GPT models and compute the test metrics of test examples at each training step. All the training curricula and the ground-truth metrics are referred to as GPTDynamics. Step 2: We train our featurized simulator on GPTDynamics, taking into account training examples at current and previous steps with the test example as input and predicts the ground-truth metric. Step 3: Given a new curriculum with the test example of interest, start from the test metric at the first step, the simulator simulates the test metric in the future training steps in an autoregressive manner.
  • Figure 2: Illustration of loss and metric simulation on NLU and NLG tasks with different TDA methods for instruction tuning. See the §\ref{['ap:examples']} for more examples.
  • Figure 3: Variation curves of the average performance of GPTfluence for loss simulation in five datasets when different checkpoint intervals are selected.
  • Figure 5: Impact of feature representation of different pre-trained encoders on loss simulation.
  • Figure 6: Comparison of the loss simulation between GPTfluence and Simfluence on instruction tuning Pythia model series, ranging from 14M to 2.8B.
  • ...and 24 more figures