Table of Contents
Fetching ...

Training Foundation Models as Data Compression: On Information, Model Weights and Copyright Law

Giorgio Franceschelli, Claudia Cevenini, Mirco Musolesi

TL;DR

This paper argues that self-supervised training of foundation models can be viewed as data compression, with the model weights acting as a compressed representation of the training data. By applying information-theoretic concepts, notably the information bottleneck, it frames the training process as compressing $X$ into representations that permit reconstruction via inference, while acknowledging lossy elements of memorization. It then explores copyright implications, proposing that weights may be treated as copies or derivative works of protected training data and outlining legal mechanisms (e.g., sui generis database rights) and exceptions (e.g., TDM, fair use) that could affect permissions and compensation for model outputs. The work emphasizes the need for multidisciplinary analysis to resolve authorship, ownership, and licensing questions across the AI supply chain, including fine-tuning and synthetic data scenarios.

Abstract

The training process of foundation models as for other classes of deep learning systems is based on minimizing the reconstruction error over a training set. For this reason, they are susceptible to the memorization and subsequent reproduction of training samples. In this paper, we introduce a training-as-compressing perspective, wherein the model's weights embody a compressed representation of the training data. From a copyright standpoint, this point of view implies that the weights can be considered a reproduction or, more likely, a derivative work of a potentially protected set of works. We investigate the technical and legal challenges that emerge from this framing of the copyright of outputs generated by foundation models, including their implications for practitioners and researchers. We demonstrate that adopting an information-centric approach to the problem presents a promising pathway for tackling these emerging complex legal issues.

Training Foundation Models as Data Compression: On Information, Model Weights and Copyright Law

TL;DR

This paper argues that self-supervised training of foundation models can be viewed as data compression, with the model weights acting as a compressed representation of the training data. By applying information-theoretic concepts, notably the information bottleneck, it frames the training process as compressing into representations that permit reconstruction via inference, while acknowledging lossy elements of memorization. It then explores copyright implications, proposing that weights may be treated as copies or derivative works of protected training data and outlining legal mechanisms (e.g., sui generis database rights) and exceptions (e.g., TDM, fair use) that could affect permissions and compensation for model outputs. The work emphasizes the need for multidisciplinary analysis to resolve authorship, ownership, and licensing questions across the AI supply chain, including fine-tuning and synthetic data scenarios.

Abstract

The training process of foundation models as for other classes of deep learning systems is based on minimizing the reconstruction error over a training set. For this reason, they are susceptible to the memorization and subsequent reproduction of training samples. In this paper, we introduce a training-as-compressing perspective, wherein the model's weights embody a compressed representation of the training data. From a copyright standpoint, this point of view implies that the weights can be considered a reproduction or, more likely, a derivative work of a potentially protected set of works. We investigate the technical and legal challenges that emerge from this framing of the copyright of outputs generated by foundation models, including their implications for practitioners and researchers. We demonstrate that adopting an information-centric approach to the problem presents a promising pathway for tackling these emerging complex legal issues.
Paper Structure (8 sections, 2 equations, 3 figures)

This paper contains 8 sections, 2 equations, 3 figures.

Figures (3)

  • Figure 1: The training-as-compressing perspective. The training set is compressed into the weights via a training algorithm; the source data can be retrieved using the appropriate model's input.
  • Figure 2: A thought experiment to confirm that LLMs memorize training data: even if the sentence is semantically nonsensical, the model assigns high probability to its tokens just because it occurred in its training set.
  • Figure 3: A schematic summary of the legal framework resulting from the training-as-compressing perspective. The blue arrows connect potentially protected entities to their copies or derivative works: the foundation model is a copy or a derivative work of the training data; fine-tuning leads to a new derivative work of the foundation model and the tuning data; and an AI-generated work is a derivative work of either the foundation or the fine-tuned model. The yellow (dotted) and red (dashed) arrows directly link the AI-generated work back to training data and training and tuning data, respectively, only through steps requiring specific exceptions or authorization.