Training Foundation Models as Data Compression: On Information, Model Weights and Copyright Law
Giorgio Franceschelli, Claudia Cevenini, Mirco Musolesi
TL;DR
This paper argues that self-supervised training of foundation models can be viewed as data compression, with the model weights acting as a compressed representation of the training data. By applying information-theoretic concepts, notably the information bottleneck, it frames the training process as compressing $X$ into representations that permit reconstruction via inference, while acknowledging lossy elements of memorization. It then explores copyright implications, proposing that weights may be treated as copies or derivative works of protected training data and outlining legal mechanisms (e.g., sui generis database rights) and exceptions (e.g., TDM, fair use) that could affect permissions and compensation for model outputs. The work emphasizes the need for multidisciplinary analysis to resolve authorship, ownership, and licensing questions across the AI supply chain, including fine-tuning and synthetic data scenarios.
Abstract
The training process of foundation models as for other classes of deep learning systems is based on minimizing the reconstruction error over a training set. For this reason, they are susceptible to the memorization and subsequent reproduction of training samples. In this paper, we introduce a training-as-compressing perspective, wherein the model's weights embody a compressed representation of the training data. From a copyright standpoint, this point of view implies that the weights can be considered a reproduction or, more likely, a derivative work of a potentially protected set of works. We investigate the technical and legal challenges that emerge from this framing of the copyright of outputs generated by foundation models, including their implications for practitioners and researchers. We demonstrate that adopting an information-centric approach to the problem presents a promising pathway for tackling these emerging complex legal issues.
