Table of Contents
Fetching ...

GSS: Gated Subspace Steering for Selective Memorization Mitigation in LLMs

Xuanqi Zhang, Haoyang Shang, Xiaoxiao Li

TL;DR

The paper tackles memorization in large language models by showing it is a sparse, token-level phenomenon and proposing Gated Subspace Steering (GSS), an inference-time framework that decouples memorization detection (probe) from targeted correction (steer). The optimal probe-steer pair is derived via a principled whitening-and-SVD approach (optimal subspace steering), enabling a low-rank intervention that activates only when memorization signals exceed a threshold. Empirical results across TinyMem, Pythia, GSM8K, and UltraChat demonstrate state-of-the-art memorization reduction with minimal impact on generalization performance and negligible inference-time overhead, while also offering theoretical insights into the geometry of memorization in activation spaces. The method provides a practical, scalable defense for privacy and robustness in deployment, without requiring retraining or global parameter updates.

Abstract

Large language models (LLMs) can memorize and reproduce training sequences verbatim -- a tendency that undermines both generalization and privacy. Existing mitigation methods apply interventions uniformly, degrading performance on the majority of tokens that generalize normally. We show empirically that memorization is sparse, intermittent, and token-conditioned, suggesting that effective mitigation requires context-aware intervention rather than static parameter modification. To this end, we propose a novel and effective selective memorization mitigation method -- Gated Subspace Steering (GSS), which decomposes intervention into a probe (detecting memorization-relevant activations) and a steer (applying targeted correction only when the probe exceeds a threshold). The optimal probe-steer pair emerges from a principled optimization framework based on optimal subspace steering. Experiments on four benchmarks show GSS matches or exceeds state-of-the-art memorization reduction while requiring $100-1000 \times$ less compute than optimization-based alternatives. Furthermore, we provide new theoretical insights into the geometry of memorization in neural representations.

GSS: Gated Subspace Steering for Selective Memorization Mitigation in LLMs

TL;DR

The paper tackles memorization in large language models by showing it is a sparse, token-level phenomenon and proposing Gated Subspace Steering (GSS), an inference-time framework that decouples memorization detection (probe) from targeted correction (steer). The optimal probe-steer pair is derived via a principled whitening-and-SVD approach (optimal subspace steering), enabling a low-rank intervention that activates only when memorization signals exceed a threshold. Empirical results across TinyMem, Pythia, GSM8K, and UltraChat demonstrate state-of-the-art memorization reduction with minimal impact on generalization performance and negligible inference-time overhead, while also offering theoretical insights into the geometry of memorization in activation spaces. The method provides a practical, scalable defense for privacy and robustness in deployment, without requiring retraining or global parameter updates.

Abstract

Large language models (LLMs) can memorize and reproduce training sequences verbatim -- a tendency that undermines both generalization and privacy. Existing mitigation methods apply interventions uniformly, degrading performance on the majority of tokens that generalize normally. We show empirically that memorization is sparse, intermittent, and token-conditioned, suggesting that effective mitigation requires context-aware intervention rather than static parameter modification. To this end, we propose a novel and effective selective memorization mitigation method -- Gated Subspace Steering (GSS), which decomposes intervention into a probe (detecting memorization-relevant activations) and a steer (applying targeted correction only when the probe exceeds a threshold). The optimal probe-steer pair emerges from a principled optimization framework based on optimal subspace steering. Experiments on four benchmarks show GSS matches or exceeds state-of-the-art memorization reduction while requiring less compute than optimization-based alternatives. Furthermore, we provide new theoretical insights into the geometry of memorization in neural representations.
Paper Structure (75 sections, 2 theorems, 74 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 75 sections, 2 theorems, 74 equations, 12 figures, 8 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\Sigma_{gen} = L L^\top$ be the Cholesky decomposition of the generalization covariance matrix. Consider the direction-wise optimization problem Under the transformation $\tilde{u} = L^\top u$ and $\mathbf{M}_{op}=L^{-1}\mathbf{M}$, this problem reduces to An optimal solution is given by where $\tilde{u}_1, \tilde{v}_1$ are the leading left and right singular vectors of the whitened memori

Figures (12)

  • Figure 1: Token-level memorization statistics. (a) Histogram of consecutive tokens with memorization signal ($\omega_t>0$). (b) Right-skewed heavy-tailed distribution, indicating memorization is driven by a small fraction of high-magnitude tokens.
  • Figure 2: (Top) Gated Subspace Steering (GSS) Overview. (a) From memorization signals $\omega_t$, we derive decoupled Probe ($u_k$) and Steer ($v_k$) directions from the Memorization Matrix ($\mathbf{M}$) and Generalization Manifold ($\Sigma_{\text{gen}}$) via Generalized SVD. (b) During inference, the gating mechanism $\mathcal{G}$ computes the signal strength $\langle h, u_k \rangle$; the steering vector is applied only when this it exceeds a safety threshold $\varepsilon$.
  • Figure 3: Geometric Visualization. A comparison of intervention in the activation space. (a) Naive Steering applies a constant subtraction to all tokens, resulting in an "Unconstrained Shift" that inadvertently degrades generalized representations (shifting blue points). (b) Gated Subspace Steering establishes a "Safe Subspace", where the intervention activates for memorized tokens (orange) that violate the gate threshold, effectively mitigating memorization while preserving the generalization manifold.
  • Figure 4: Pareto Frontier Analysis. The plots visualize the trade-off between memorization reduction measured (x-axis, normalized by baseline) and downstream utility measured in log-likelihood (y-axis, normalized by baseline). The ideal method occupies the top-right corner. (Top: GSM8K) On reasoning tasks, our method dominates the frontier. We observe utility recovery where mild steering improves performance ($>1.0$). (Bottom: UltraChat) Our method consistently outperforms Editing baselines and maintains a Pareto frontier in the high-utility regime
  • Figure 5: Ablation study on $\epsilon$ in the TinyMem math and Pythia 2.8B models.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Definition 3.1: Token-level Memorization Signal
  • Theorem 4.1: Optimal Probe--Steer Direction
  • Definition 5.1: $(n,k)$-Memorization
  • Theorem A.1: Optimal Memorization Suppression Subspace
  • proof