Table of Contents
Fetching ...

Differentially Private Steering for Large Language Model Alignment

Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal

TL;DR

This work tackles privacy in LLM alignment by studying activation editing under formal differential privacy guarantees. It introduces PSA, a simple, training-free method that adds calibrated noise to private steering vectors computed from positive and negative demonstrations, ensuring $(|\mathcal{S}|\varepsilon,|\mathcal{S}|\delta)$-DP while preserving alignment, generation quality, and general reasoning. Across seven alignment benchmarks and multiple open-source LLMs, PSA achieves DP guarantees with minimal utility loss compared to non-private steering and often outperforms zero-shot baselines. An accompanying Membership Inference Attack demonstrates empirical privacy improvements, and scaling to larger models further strengthens the privacy-utility tradeoff, underscoring the practical viability of privacy-preserving LLM steering.

Abstract

Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques.

Differentially Private Steering for Large Language Model Alignment

TL;DR

This work tackles privacy in LLM alignment by studying activation editing under formal differential privacy guarantees. It introduces PSA, a simple, training-free method that adds calibrated noise to private steering vectors computed from positive and negative demonstrations, ensuring -DP while preserving alignment, generation quality, and general reasoning. Across seven alignment benchmarks and multiple open-source LLMs, PSA achieves DP guarantees with minimal utility loss compared to non-private steering and often outperforms zero-shot baselines. An accompanying Membership Inference Attack demonstrates empirical privacy improvements, and scaling to larger models further strengthens the privacy-utility tradeoff, underscoring the practical viability of privacy-preserving LLM steering.

Abstract

Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques.

Paper Structure

This paper contains 42 sections, 5 equations, 5 figures, 21 tables, 3 algorithms.

Figures (5)

  • Figure 1: An overview of Private Steering for LLM Alignment (PSA). (Left) We first generate differentially private steering vectors with positive and negative demonstrations by adding calibrated noise to the steering vectors. (Right) The private steering vectors are then added to the activations of the LLM layers during inference which ensures the generated texts for any query are differentially private with respect to the paired demonstrations.
  • Figure 2: Generating private steering vectors
  • Figure 3: Results of PCA, Mean Steering and PSA with Llama, Mistral, Gemma and Qwen on the seven benchmark alignment datasets. The dotted line represents the zero-shot performance. The Y-axis represents the accuracy in choosing the correct behavioral option (higher is better).
  • Figure 4: Scaling behavior of PSA on Qwen2.5 series of LLMs for the Refusal dataset. We observe that PSA has a higher utility degradation in smaller LLMs.
  • Figure 5: Ablation results on the three largest datasets used in this study. We observe consistent utility degradation with increasing noise levels and clipping factors.

Theorems & Definitions (1)

  • Definition 1