Table of Contents
Fetching ...

Avoiding Copyright Infringement via Large Language Model Unlearning

Guangyao Dou, Zheyuan Liu, Qing Lyu, Kaize Ding, Eric Wong

TL;DR

The paper tackles the challenge of copyright infringement in large language models by proposing Stable Sequential Unlearning (SSU), a method to forget copyrighted content across multiple time steps without retraining from scratch. SSU uses stable task vectors, random labeling loss, and a gradient-based weight saliency map to limit updates to the most relevant parameters, enabling effective unlearning while preserving general knowledge and language abilities. Through experiments on Llama-3.1-8B-Instruct and Mistral-7B-Instruct, SSU outperforms baselines (including NPO and Gradient Difference) in the trade-off between reducing copyright leakage ( Rouge-based metrics) and maintaining MMLU/MT-Bench performance, though some unintended knowledge loss and re-emergence remain challenges. The work highlights the practical viability of sequential copyright takedown in production LLMs and discusses robustness, limitations, and avenues for future improvement, such as combining unlearning with generation-time safeguards and data-tracing tools.

Abstract

Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In real-world scenarios, model owners need to continuously address copyright infringement as new requests for content removal emerge at different time points. This leads to the need for sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model's parameters that correspond to copyrighted content. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters. Experimental results show that SSU achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines.

Avoiding Copyright Infringement via Large Language Model Unlearning

TL;DR

The paper tackles the challenge of copyright infringement in large language models by proposing Stable Sequential Unlearning (SSU), a method to forget copyrighted content across multiple time steps without retraining from scratch. SSU uses stable task vectors, random labeling loss, and a gradient-based weight saliency map to limit updates to the most relevant parameters, enabling effective unlearning while preserving general knowledge and language abilities. Through experiments on Llama-3.1-8B-Instruct and Mistral-7B-Instruct, SSU outperforms baselines (including NPO and Gradient Difference) in the trade-off between reducing copyright leakage ( Rouge-based metrics) and maintaining MMLU/MT-Bench performance, though some unintended knowledge loss and re-emergence remain challenges. The work highlights the practical viability of sequential copyright takedown in production LLMs and discusses robustness, limitations, and avenues for future improvement, such as combining unlearning with generation-time safeguards and data-tracing tools.

Abstract

Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In real-world scenarios, model owners need to continuously address copyright infringement as new requests for content removal emerge at different time points. This leads to the need for sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model's parameters that correspond to copyrighted content. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters. Experimental results show that SSU achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines.
Paper Structure (61 sections, 30 equations, 11 figures, 17 tables)

This paper contains 61 sections, 30 equations, 11 figures, 17 tables.

Figures (11)

  • Figure 1: Continuations of a passage from Sherlock Holmes under different copyright takedown methods. The original continuation serves as the ground truth. The vanilla model, prompting, and MemFree decoding exhibit high risk of copyright infringement. In contrast, SSU produces a continuation that is transformative to avoid copyright infringement.
  • Figure 2: Overall process of our unlearning framework.(a) At each time step $t$, an unlearning request is received to forget the dataset $D_f^t$. The unlearning algorithm involves first fine-tuning $\theta_{u}^{t-1}$ on $D_f^t$ to obtain $\theta_{ft}^{t}$, and then subtracting the task vector from previously unlearned model $\theta_{u}^{t-1}$. (b) At each time step t. we compute the gradient loss and random labeling loss to obtain the objective $L_f(\theta_{u}^{t-1})$ that will be used for fine-tuning. (c) At time step $t+1$, we fine-tune $\theta_{u}^{t}$ using the objective we obtained in (b), and only update model weights that are most salient using weight saliency mapping.
  • Figure 3: The averaged Rouge-1 and Rouge-L scores and benchmark scores for Llama3.1, omitting baseline methods that either consistently have low unlearning efficacy or easy to collapse (collapse details in Appendix \ref{['sec:appendix-full_experiment_numbers']}): (a) books to forget $D_f$ ($\downarrow$); (b) previously unlearned books $D_{prev}$ ($\downarrow$); (c) $D_{nor}$ ($\uparrow$). and (d) averaged normalized MMLU and MT-Bench scores ($\uparrow$). Lower Rouge scores on $D_f$ and $D_{prev}$ indicate better unlearning, while higher scores for $D_{nor}$ and benchmarks reflect better performance. Result with all methods is in Figure \ref{['fig:main_book_forget_all_appendix']} in Appendix \ref{['sec:appendix-full-mistral-llama-results']}.
  • Figure 4: Trade-off analysis between general-purpose language abilities and unlearning efficacy for Llama3.1 and Mistral-7B, considering only methods at time steps greater than 1. For improved visualization, we exclude Prompting (a) and GA, which consistently exhibit low unlearning efficacy or collapse during the process. We also exclude TV beyond time step 9 (Llama3.1) and time step 3 (Mistral-7B), as well as Gradient Difference beyond time step 4 (Mistral-7B), due to collapose at these stage (see Appendix \ref{['sec:appendix-full_experiment_numbers']} for collapse details). General-purpose abilities are calculated using normalized averages of MMLU and MT-Bench scores, while unlearning efficacy is represented by the average of Rouge-1 and Rouge-L scores on $D_f$ (targeted data) and $D_{prev}$ (previously unlearned data). Lower Rouge scores indicate better unlearning performance; hence, these values are negated for clarity. The ideal method balances both metrics and is positioned in the top-right corner. Full result with all methods is in Figure \ref{['fig:trade_off_appendix']}.
  • Figure 5: Ablation study of SSU for Llama3.1-8B-Instruct. The orange line represents unlearning without the weight saliency map, while the purple line shows the effect of removing the random labeling loss.
  • ...and 6 more figures