Table of Contents
Fetching ...

Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

Haoran Wang, Li Xiong, Kai Shu

Abstract

Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.

Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

Abstract

Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.

Paper Structure

This paper contains 51 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of using CI-Parametric steering to mitigate contextual privacy leakage. In this scenario, the LLM must determine whether Nancy (the sender) is permitted to share Steve’s secret (the data subject) with Bob (the recipient).
  • Figure 2: Multi-dimensional privacy on CONFAIDE Tier 2. Left: PCA requires $k{=}3$ components to achieve best results; Right: Layer-wise AUROC of probe transfer vs. PCA (1st PC). The probe improves monotonically, while PCA rises only after layer 15.
  • Figure 3: Overview of CI-parametric steering.
  • Figure 4: CI-parameter ablation on CONFAIDE ($\alpha{=}0.5$).
  • Figure 5: Leakage as a function of $\alpha$. CI-parametric steering is less sensitive to $\alpha$ on both synthetic (left) and CONFAIDE (right) datasets, while monolithic steering is highly sensitive.
  • ...and 2 more figures