Interrogating LLM design under a fair learning doctrine

Johnny Tian-Zheng Wei; Maggie Wang; Ameya Godbole; Jonathan H. Choi; Robin Jia

Interrogating LLM design under a fair learning doctrine

Johnny Tian-Zheng Wei, Maggie Wang, Ameya Godbole, Jonathan H. Choi, Robin Jia

TL;DR

This paper argues for a structural approach to copyright risk in LLMs, contending that training-design decisions shape memorization and potential infringement more directly than outputs alone. It operationalizes 'fair learning' by conducting causal and correlational analyses of the open-source LLM Pythia to assess how data curation and upweighting influence memorization, and formalizes a doctrine to guide adjudication. The work highlights methodological tools—from randomized train/test splits and regression analyses to dataset-density-based ablations—and discusses how courts could balance transparency, innovation, and copyright interests. It also advocates incorporating external technical guidelines to improve clarity and suggests a path toward rule-like guidance that remains adaptable to evolving technology.

Abstract

The current discourse on large language models (LLMs) and copyright largely takes a "behavioral" perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary "structural" perspective and shift our focus to how LLMs are trained. We operationalize a notion of "fair learning" by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.

Interrogating LLM design under a fair learning doctrine

TL;DR

Abstract

Interrogating LLM design under a fair learning doctrine

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)