Table of Contents
Fetching ...

SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs

Zhenliang Zhang, Xinyu Hu, Xiaojun Wan

TL;DR

This work reframes copyright infringement mitigation in LLMs as intrinsic semantic-space control and introduces SCoPe, a two-stage, inference-time method that identifies a copyright-sensitive subspace in a sparse SAE latent space and clamps its activations during decoding. By moving away from surface-level filters, SCoPe achieves substantial reductions in copyrighted content regurgitation while preserving general utility, validated on NewsQA, BookSum, and MMLU benchmarks. The core contributions include formulating a subspace hypothesis, defining the Copyright Alignment Score to empirically identify a compact subspace, and demonstrating both causal control (via reverse intervention) and interpretability of the copyright-related semantics. This approach offers a lightweight, filter-free mechanism for copyright protection with practical implications for deployment, though it relies on open models with accessible intermediate representations and assumes a linear subspace. Overall, SCoPe shows that targeted, semantic-level interventions at decoding can reconcile copyright risk with task performance in LLMs.

Abstract

Large language models sometimes inadvertently reproduce passages that are copyrighted, exposing downstream applications to legal risk. Most existing studies for inference-time defences focus on surface-level token matching and rely on external blocklists or filters, which add deployment complexity and may overlook semantically paraphrased leakage. In this work, we reframe copyright infringement mitigation as intrinsic semantic-space control and introduce SCOPE, an inference-time method that requires no parameter updates or auxiliary filters. Specifically, the sparse autoencoder (SAE) projects hidden states into a high-dimensional, near-monosemantic space; benefiting from this representation, we identify a copyright-sensitive subspace and clamp its activations during decoding. Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility. Further interpretability analyses confirm that the isolated subspace captures high-level semantics.

SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs

TL;DR

This work reframes copyright infringement mitigation in LLMs as intrinsic semantic-space control and introduces SCoPe, a two-stage, inference-time method that identifies a copyright-sensitive subspace in a sparse SAE latent space and clamps its activations during decoding. By moving away from surface-level filters, SCoPe achieves substantial reductions in copyrighted content regurgitation while preserving general utility, validated on NewsQA, BookSum, and MMLU benchmarks. The core contributions include formulating a subspace hypothesis, defining the Copyright Alignment Score to empirically identify a compact subspace, and demonstrating both causal control (via reverse intervention) and interpretability of the copyright-related semantics. This approach offers a lightweight, filter-free mechanism for copyright protection with practical implications for deployment, though it relies on open models with accessible intermediate representations and assumes a linear subspace. Overall, SCoPe shows that targeted, semantic-level interventions at decoding can reconcile copyright risk with task performance in LLMs.

Abstract

Large language models sometimes inadvertently reproduce passages that are copyrighted, exposing downstream applications to legal risk. Most existing studies for inference-time defences focus on surface-level token matching and rely on external blocklists or filters, which add deployment complexity and may overlook semantically paraphrased leakage. In this work, we reframe copyright infringement mitigation as intrinsic semantic-space control and introduce SCOPE, an inference-time method that requires no parameter updates or auxiliary filters. Specifically, the sparse autoencoder (SAE) projects hidden states into a high-dimensional, near-monosemantic space; benefiting from this representation, we identify a copyright-sensitive subspace and clamp its activations during decoding. Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility. Further interpretability analyses confirm that the isolated subspace captures high-level semantics.

Paper Structure

This paper contains 72 sections, 16 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Visualization of semantic separation and subspace discrimination. (a) SAE sparse space enables better separation of copyright-sensitive dimensions. (b) In the full space, activations for both corpora overlap. (c) In the estimated subspace $\hat{\mathcal{S}}$, activations become clearly separable.
  • Figure 2: Analysis of the dimension $n$ of subspace. The vertical dashed line marks the chosen setting $n=1000$, which balances maximal risk mitigation with no loss in general utility.
  • Figure 3: Impact of the reverse intervention. As we amplify the features in the copyrighted subspace $\hat{\mathcal{S}}$ with an increasing factor $\alpha$, the mitigation win rate progressively drops from 8.7% to 4.1%, indicating that the LLM becomes more prone to reproducing copyrighted content. This provides causal evidence that the subspace $\hat{\mathcal{S}}$ is directly responsible for generating copyrighted content.
  • Figure 4: Activation frequency distributions for (a) dense LLM features, which cluster tightly in the upper‐right with no clear separation of copyright‐sensitive dimensions, and (b) sparse SAE features, where the copyrighted subspace (orange) forms a distinct cluster in the upper‐left region.
  • Figure 5: Activation profiles for six representative SAE features. In each subfigure, the top negative logits (red) identify tokens most inhibited by the feature, while the top positive logits (blue) highlight tokens most strongly activated. Panels (a)–(f) correspond to features #755, #10184, #11848, #15445, #7089, and #12993, respectively.