Table of Contents
Fetching ...

LLMs and Memorization: On Quality and Specificity of Copyright Compliance

Felix B Mueller, Rebekka Görge, Anna K Bernzen, Janna C Pirk, Maximilian Poretschkin

TL;DR

It is found that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations.

Abstract

Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code can be found at https://github.com/felixbmuller/llms-memorization-copyright.

LLMs and Memorization: On Quality and Specificity of Copyright Compliance

TL;DR

It is found that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations.

Abstract

Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code can be found at https://github.com/felixbmuller/llms-memorization-copyright.
Paper Structure (44 sections, 4 equations, 4 figures, 6 tables)

This paper contains 44 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Example of our prompting, text matching, and labelling of outputs applied to various large language models.
  • Figure 2: $\operatorname{SRR}$ for different prompt types and LLMs, separated by Copyright (left) and PublicDomain (right). We normalize $\operatorname{SRR}$ by the number of prompts of each type.
  • Figure 3: $\operatorname{CDR}$ (left) and $\operatorname{SRR}_\text{CR}$ and $\operatorname{SRR}_\text{PD}$ (right) for different model sizes of LLama 2 and Vicuna. We use the mean over five runs for models that are not part of the main comparison.
  • Figure 4: Combination of labels occurring in the outputs for GPT 4 and Llama 2 chat. The category other summarizes other labels, and combinations with more than three categories.

Theorems & Definitions (4)

  • Definition 1: Longest Common Subsequence
  • Definition 2: Text Similarity
  • Definition 3: Fuzzy Extractable Memorization
  • Definition 4: Fuzzy Threshold Common Substring Problem