On Training Data Influence of GPT Models

Yekun Chai; Qingyi Liu; Shuohuan Wang; Yu Sun; Qiwei Peng; Hua Wu

On Training Data Influence of GPT Models

Yekun Chai, Qingyi Liu, Shuohuan Wang, Yu Sun, Qiwei Peng, Hua Wu

TL;DR

GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation.

Abstract

Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available at https://github.com/ernie-research/gptfluence.

On Training Data Influence of GPT Models

TL;DR

Abstract

Paper Structure (60 sections, 8 equations, 29 figures, 13 tables, 1 algorithm)

This paper contains 60 sections, 8 equations, 29 figures, 13 tables, 1 algorithm.

Introduction
Contribution
Preliminaries
Task Definition
Training Data Attribution
TracIn
GPTfluence: Featurized Simulation-based Approach
Overview
Featurized Simulation Approach
Connection to Previous Approaches
Experiments
Experimental Settings
GPTDynamics Data Collection
Experiment Setup for Simulators
Test Loss Estimation
...and 45 more sections

Figures (29)

Figure 1: Overview of GPTfluence. Step 1: We sample training data to create curricula for training GPT models and compute the test metrics of test examples at each training step. All the training curricula and the ground-truth metrics are referred to as GPTDynamics. Step 2: We train our featurized simulator on GPTDynamics, taking into account training examples at current and previous steps with the test example as input and predicts the ground-truth metric. Step 3: Given a new curriculum with the test example of interest, start from the test metric at the first step, the simulator simulates the test metric in the future training steps in an autoregressive manner.
Figure 2: Illustration of loss and metric simulation on NLU and NLG tasks with different TDA methods for instruction tuning. See the §\ref{['ap:examples']} for more examples.
Figure 3: Variation curves of the average performance of GPTfluence for loss simulation in five datasets when different checkpoint intervals are selected.
Figure 5: Impact of feature representation of different pre-trained encoders on loss simulation.
Figure 6: Comparison of the loss simulation between GPTfluence and Simfluence on instruction tuning Pythia model series, ranging from 14M to 2.8B.
...and 24 more figures

On Training Data Influence of GPT Models

TL;DR

Abstract

On Training Data Influence of GPT Models

Authors

TL;DR

Abstract

Table of Contents

Figures (29)