Table of Contents
Fetching ...

We Should Separate Memorization from Copyright

Adi Haviv, Niva Elkin-Koren, Uri Hacohen, Roi Livni, Shay Moran

TL;DR

The paper argues that memorization in generative models should not be treated as equivalent to copyright infringement, and that current reconstruction attacks do not by themselves establish copying under law. It develops a legal-technical framework that distinguishes技 memorization as a technical property from copying as a legal concept, and emphasizes evaluating copyright risk at the output level using thin versus thick protection criteria. It offers a taxonomy of infringement-relevant signals, reinterprets existing attacks through copyright doctrine, and calls for theory-grounded, benchmarked, and mitigation-focused research. The work aims to align ML evaluation with copyright standards to improve research auditing, policy discussions, and responsible deployment of generative systems.

Abstract

The widespread use of foundation models has introduced a new risk factor of copyright issue. This issue is leading to an active, lively and on-going debate amongst the data-science community as well as amongst legal scholars. Where claims and results across both sides are often interpreted in different ways and leading to different implications. Our position is that much of the technical literature relies on traditional reconstruction techniques that are not designed for copyright analysis. As a result, memorization and copying have been conflated across both technical and legal communities and in multiple contexts. We argue that memorization, as commonly studied in data science, should not be equated with copying and should not be used as a proxy for copyright infringement. We distinguish technical signals that meaningfully indicate infringement risk from those that instead reflect lawful generalization or high-frequency content. Based on this analysis, we advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards and provides a more principled foundation for research, auditing, and policy.

We Should Separate Memorization from Copyright

TL;DR

The paper argues that memorization in generative models should not be treated as equivalent to copyright infringement, and that current reconstruction attacks do not by themselves establish copying under law. It develops a legal-technical framework that distinguishes技 memorization as a technical property from copying as a legal concept, and emphasizes evaluating copyright risk at the output level using thin versus thick protection criteria. It offers a taxonomy of infringement-relevant signals, reinterprets existing attacks through copyright doctrine, and calls for theory-grounded, benchmarked, and mitigation-focused research. The work aims to align ML evaluation with copyright standards to improve research auditing, policy discussions, and responsible deployment of generative systems.

Abstract

The widespread use of foundation models has introduced a new risk factor of copyright issue. This issue is leading to an active, lively and on-going debate amongst the data-science community as well as amongst legal scholars. Where claims and results across both sides are often interpreted in different ways and leading to different implications. Our position is that much of the technical literature relies on traditional reconstruction techniques that are not designed for copyright analysis. As a result, memorization and copying have been conflated across both technical and legal communities and in multiple contexts. We argue that memorization, as commonly studied in data science, should not be equated with copying and should not be used as a proxy for copyright infringement. We distinguish technical signals that meaningfully indicate infringement risk from those that instead reflect lawful generalization or high-frequency content. Based on this analysis, we advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards and provides a more principled foundation for research, auditing, and policy.
Paper Structure (33 sections, 2 figures)

This paper contains 33 sections, 2 figures.

Figures (2)

  • Figure 1: Illustration of the spectrum of copyright protection, ranging from thin to thick.
  • Figure 2: Illustration of the spectrum of copyright protection for generated images. Examples are arranged vertically from thin protection (bottom) to thick protection (top), and horizontally from expression (literal copies of the original image) to abstract idea. Except for the leftmost column, all variations were generated using Google’s Nano Banana 3 Pro model.