Table of Contents
Fetching ...

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Joshua Freeman, Chloe Rippe, Edoardo Debenedetti, Maksym Andriushchenko

TL;DR

This paper investigates verbatim memorization and potential copyright infringement in frontier LLMs within the context of The New York Times v. OpenAI. It curates NYT articles, uses three prompt-injection attacks, and performs a grid search over model size and prefix length, applying five memorization metrics to quantify verbatim copying and semantic content retention. The authors find OpenAI models generally exhibit less memorization than competitors, likely due to aggressive filtering, while larger models and highly duplicated content show greater memorization; they also discuss the legal implications for fair use and contributory infringement and propose defense strategies. The work highlights the practical and policy significance of memorization mitigation as LLMs scale and informs ongoing legal and technological debates at the intersection of AI, copyright, and policy.

Abstract

Copyright infringement in frontier LLMs has received much attention recently due to the New York Times v. OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs, thereby publicly displaying them in LLM outputs. Our work aims to measure the propensity of OpenAI's LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization. Our findings have practical implications for training: more attention must be placed on preventing verbatim memorization in very large models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI's LLMs, we probe the strength of The New York Times's copyright infringement claims and OpenAI's legal defenses, while underscoring issues at the intersection of generative AI, law, and policy.

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

TL;DR

This paper investigates verbatim memorization and potential copyright infringement in frontier LLMs within the context of The New York Times v. OpenAI. It curates NYT articles, uses three prompt-injection attacks, and performs a grid search over model size and prefix length, applying five memorization metrics to quantify verbatim copying and semantic content retention. The authors find OpenAI models generally exhibit less memorization than competitors, likely due to aggressive filtering, while larger models and highly duplicated content show greater memorization; they also discuss the legal implications for fair use and contributory infringement and propose defense strategies. The work highlights the practical and policy significance of memorization mitigation as LLMs scale and informs ongoing legal and technological debates at the intersection of AI, copyright, and policy.

Abstract

Copyright infringement in frontier LLMs has received much attention recently due to the New York Times v. OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs, thereby publicly displaying them in LLM outputs. Our work aims to measure the propensity of OpenAI's LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization. Our findings have practical implications for training: more attention must be placed on preventing verbatim memorization in very large models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI's LLMs, we probe the strength of The New York Times's copyright infringement claims and OpenAI's legal defenses, while underscoring issues at the intersection of generative AI, law, and policy.

Paper Structure

This paper contains 23 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Model size vs. longest common contiguous subsequence (in characters). The amount of verbatim memorization increases significantly for larger models, especially those with more than 100 billion parameters. The error bars represent the range of $\pm1$ standard deviation taken across all samples. Note that for GPT and Claude models, we exclude articles for which a model generates a refusal message or for which an output filter blocks the generation (see Table \ref{['tab:ns']} for the precise numbers of excluded articles).
  • Figure 2: Our prompt-based attack #3 that involves a role play and fictional future scenario. A fake history is inserted, giving the impression to the model that it has already started outputting the article.
  • Figure 3: Prefix Size vs. LCCS (characters). Articles involved in the lawsuit show higher levels of memorization compared to arbitrary articles, which in turn show higher memorization than new, previously unseen articles. Memorization rates do not monotonically increase with larger prefix sizes. Instead, peak memorization occurs around the 150-token prefix mark.
  • Figure 4: Comparison of attacks on the system for Attack 1 and Attack 2.
  • Figure 5: List of questions asked to GPT to benchmark its average output speed. This list was obtained by prompting GPT itself.
  • ...and 3 more figures