Table of Contents
Fetching ...

Discovering Universal Activation Directions for PII Leakage in Language Models

Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong

TL;DR

UniLeak is presented, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts.

Abstract

Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling both risk amplification and mitigation.

Discovering Universal Activation Directions for PII Leakage in Language Models

TL;DR

UniLeak is presented, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts.

Abstract

Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling both risk amplification and mitigation.
Paper Structure (25 sections, 4 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Mechanistic origin of PII leakage. Prior work frames PII leakage as a prompt-dependent behavior (left). UniLeak reveals a universal activation direction that amplifies PII generation across prompts (center). This representation-level vulnerability admits both adversarial exploitation via minimal internal modification and mitigation via projection-based suppression (right).
  • Figure 2: Overlap between PII (phone numbers) extracted by UniLeak and baseline attacks for Phi-2B trained on Enron.
  • Figure 3: Logit Lens analysis of UniLeak on Phi-2B. Using 10k prefixes that cause the base model to generate a personal name, we report the average probability of generating the first PII token at each layer, measured with Logit Lens LogitLens.
  • Figure 4: Direct logit attribution of UniLeak on Phi-2B. Using 10k prefixes that cause the base model to generate a personal name, we report the direct logit attribution elhage2021mathematical for the outputs of the last ten self-attention and MLP layers.
  • Figure 5: Contextual similarity between generated PII prefixes and training-data prefixes in Phi-2B.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Definition 2.1