Table of Contents
Fetching ...

Interrogating LLM design under a fair learning doctrine

Johnny Tian-Zheng Wei, Maggie Wang, Ameya Godbole, Jonathan H. Choi, Robin Jia

TL;DR

This paper argues for a structural approach to copyright risk in LLMs, contending that training-design decisions shape memorization and potential infringement more directly than outputs alone. It operationalizes 'fair learning' by conducting causal and correlational analyses of the open-source LLM Pythia to assess how data curation and upweighting influence memorization, and formalizes a doctrine to guide adjudication. The work highlights methodological tools—from randomized train/test splits and regression analyses to dataset-density-based ablations—and discusses how courts could balance transparency, innovation, and copyright interests. It also advocates incorporating external technical guidelines to improve clarity and suggests a path toward rule-like guidance that remains adaptable to evolving technology.

Abstract

The current discourse on large language models (LLMs) and copyright largely takes a "behavioral" perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary "structural" perspective and shift our focus to how LLMs are trained. We operationalize a notion of "fair learning" by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.

Interrogating LLM design under a fair learning doctrine

TL;DR

This paper argues for a structural approach to copyright risk in LLMs, contending that training-design decisions shape memorization and potential infringement more directly than outputs alone. It operationalizes 'fair learning' by conducting causal and correlational analyses of the open-source LLM Pythia to assess how data curation and upweighting influence memorization, and formalizes a doctrine to guide adjudication. The work highlights methodological tools—from randomized train/test splits and regression analyses to dataset-density-based ablations—and discusses how courts could balance transparency, innovation, and copyright interests. It also advocates incorporating external technical guidelines to improve clarity and suggests a path toward rule-like guidance that remains adaptable to evolving technology.

Abstract

The current discourse on large language models (LLMs) and copyright largely takes a "behavioral" perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary "structural" perspective and shift our focus to how LLMs are trained. We operationalize a notion of "fair learning" by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.

Paper Structure

This paper contains 42 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: MinK% (where K=20) for the Pythia 6.9b model across training (blue) and test (orange) sets, grouped by dataset component. MinK% scores are calculated and averaged on up to 1k documents for the first 256 tokens (lower indicates better memorization). Error bars denote 95% CI. The difference in the metric between train and test represents the causal effect of upweighting a single document. Differences are grouped by upweight value, assigned by the creators of the Pile. Differences for other metrics are presented in Figure \ref{['fig:all_metrics']}. All metrics are similar across the train and test splits indicating that upweighting a single document has little effect on memorization.
  • Figure 2: We illustrate the intuition of simulated ablation. Under the data density hypothesis, pieces of text are upweighted by other similar text. We fit a regression that predicts the loss of an average i.e. StackExchange snippet given that dataset's typical neighborhood. To simulate ablating a dataset, we predict the loss based on the number of neighbors if that dataset was removed.
  • Figure 4: Loss of Pythia 6.9b on each dataset when trained on the entire Pile (blue) or when that dataset is ablated (orange). To simulate an ablation, we first fit a regression to predict the loss of a dataset based on the dataset's average neighborhood. The ablated loss is predicted based on a counterfactual neighborhood count and an illustration is given in Figure \ref{['fig:scatter']}. While such a regression is not guaranteed to be causal, it can be useful for identifying areas for further experimentation when computational resources are limited. Notably, FreeLaw and PubMed Central are most affected when they are ablated from the training data.
  • Figure 5: (Top) Mean cross-entropy loss, (middle) token-level accuracy, and (bottom) verbatim memorization for the Pythia 6.9B model across training (blue) and test (orange) sets, grouped by dataset component. Error bars denote 95% CI. These metrics are described in §\ref{['sec:operationalizing']}. The difference in the metric between train and test represents the causal effect of upweighting a single document. Differences are grouped by upweight value, assigned by the creators of the Pile. Similar to MINK%, these metrics vary little across the train and test splits indicating that upweighting a single document has little effect on memorization.
  • Figure 6: Plot of the average model loss on a dataset against its average nearest-neighbor counts, for each of the 22 dataset components. The orange line is a best-fit regression line. To simulate an ablation, we use Table \ref{['fig:dataset_overlaps']} to obtain a counterfactual neighborhood count and then use the regression to calculate $\Delta Y$. The shift of the loss is parallel to the regression line. The results of each simulated ablation are presented in Figure \ref{['fig:ablation_plot']}.