Table of Contents
Fetching ...

Investigating the Feasibility of Mitigating Potential Copyright Infringement via Large Language Model Unlearning

Guangyao Dou

TL;DR

This work tackles the risk of copyright infringement in pre-trained LLMs by studying sequential unlearning, where copyrighted content is removed over time. It introduces Stable Sequential Unlearning (SSU), which combines learning stable task vectors with a random labeling loss and a gradient-based weight saliency mechanism to localize updates and minimize collateral damage to non-targeted knowledge and general-language abilities. Through experiments on Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3 using Gutenberg books, SSU generally achieves a favorable trade-off between reducing copyright risk (lower Rouge-1 and Rouge-L on forgotten and previously forgotten data) and preserving general-purpose capabilities (MMLU, MT-Bench), outperforming several baselines but not eliminating all risks. The results underscore both the potential of principled unlearning for copyright takedowns and the need for further work, including robust evaluation, certified guarantees, and complementary measures beyond unlearning to address copyright concerns in generative AI systems.

Abstract

Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In a potential real-world scenario, model owners may need to continuously address copyright infringement in order to address requests for content removal that emerge at different time points. One potential way of addressing this is via sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model's parameters that correspond to copyrighted content using task vectors. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters with gradient-based weight saliency. Extensive experimental results show that SSU sometimes achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines, but it's not a cure-all for unlearning copyrighted material.

Investigating the Feasibility of Mitigating Potential Copyright Infringement via Large Language Model Unlearning

TL;DR

This work tackles the risk of copyright infringement in pre-trained LLMs by studying sequential unlearning, where copyrighted content is removed over time. It introduces Stable Sequential Unlearning (SSU), which combines learning stable task vectors with a random labeling loss and a gradient-based weight saliency mechanism to localize updates and minimize collateral damage to non-targeted knowledge and general-language abilities. Through experiments on Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3 using Gutenberg books, SSU generally achieves a favorable trade-off between reducing copyright risk (lower Rouge-1 and Rouge-L on forgotten and previously forgotten data) and preserving general-purpose capabilities (MMLU, MT-Bench), outperforming several baselines but not eliminating all risks. The results underscore both the potential of principled unlearning for copyright takedowns and the need for further work, including robust evaluation, certified guarantees, and complementary measures beyond unlearning to address copyright concerns in generative AI systems.

Abstract

Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In a potential real-world scenario, model owners may need to continuously address copyright infringement in order to address requests for content removal that emerge at different time points. One potential way of addressing this is via sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model's parameters that correspond to copyrighted content using task vectors. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters with gradient-based weight saliency. Extensive experimental results show that SSU sometimes achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines, but it's not a cure-all for unlearning copyrighted material.

Paper Structure

This paper contains 51 sections, 30 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: An example of a GPT model generating substantially similar copyrighted content from the book Harry Potter, which is highly likely a case of copyright infringement.
  • Figure 2: Overall process of our unlearning framework.(a) At each time step $t$, an unlearning request is received to forget the dataset $D_f^t$. The unlearning algorithm involves first fine-tuning $\theta_{u}^{t-1}$ on $D_f^t$ to obtain $\theta_{ft}^{t}$, and then subtracting the task vector from previously unlearned model $\theta_{u}^{t-1}$. (b) At each time step t. we compute the gradient loss and random labeling loss to obtain the objective $L_f(\theta_{u}^{t-1})$ that will be used for fine-tuning. (c) At time step $t+1$, we fine-tune $\theta_{u}^{t}$ using the objective we obtained in (b), and only update model weights that are most salient using weight saliency mapping.
  • Figure 3: The average of Rouge-1 and Rouge-l and benchmark scores for LLaMA3.1: (a) books to forget $D_f$ ($\downarrow$); (b) previously unlearned books $D_{prev}$ ($\downarrow$); (c) $D_{nor}$ ($\uparrow$). and (d) averaged normalized MMLU and MT-Bench scores ($\uparrow$). The results for TV after time step 8 are omitted due to collapse. Lower Rouge scores for $D_f$ and $D_{prev}$ indicate better unlearning, while higher scores for $D_{nor}$ and benchmarks reflect better performance.
  • Figure 4: The average of Rouge-1 and Rouge-l score and reasoning abilities for Mistral-7B-Instruct: (a) books to forget $D_f$ ($\downarrow$); (b) previously unlearned books $D_{prev}$ ($\downarrow$); (c) $D_{nor}$ ($\uparrow$). and (d) averaged normalized MMLU and MT-Bench scores ($\uparrow$). The results for TV after time step 8 are omitted due to collapse. Lower Rouge scores for $D_f$ and $D_{prev}$ indicate better unlearning, while higher scores for $D_{nor}$ and benchmarks reflect better performance.
  • Figure 5: Trade-off between general-purpose language abilities and unlearning efficacy for Llama3.1 and Mistral-7B, including all methods, except TV beyond time step 9 (Llama3.1) and time step 3 (Mistral-7B), and Gradient Difference beyond time step 4 (Mistral-7B), as they all collapsed during these time steps. General-purpose abilities are represented by the average of MMLU and MT-Bench scores, normalized. Unlearning efficacy is measured as the average of Rouge-1 and Rouge-L scores on $D_f$ and $D_{prev}$, where lower Rouge scores indicate better unlearning performance; thus, values were negated for clarity. The ideal performance is positioned in the top-right corner. The plots capture the performance of all methods at every time step greater than 1.
  • ...and 2 more figures