Table of Contents
Fetching ...

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma

TL;DR

This work tackles the privacy risk of PII leakage in language models by proposing PATCH, a circuit-aware post-training defense that targets and edits the internal PII leakage pathways identified through mechanistic interpretability. By generating clean/corrupted prompts and applying Edge Attribution Patching with Integrated Gradients, PATCH isolates critical attention-heads and edges, computes shared leakage edges across PII types, and patches them via zero or mean ablation. The approach yields strong privacy improvements with favorable utility trade-offs compared to scrubbing, DP, and prior editing methods, with recall reductions up to 65% and, when combined with DP, residual leakage down to as low as 0.01%. These findings demonstrate that targeted circuit interventions can complement formal privacy guarantees and offer practical, model-sensitive strategies for mitigating PII leakage in diverse transformer architectures.

Abstract

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore, we propose PATCH (Privacy-Aware Targeted Circuit PatcHing), a novel approach that first identifies and subsequently directly edits PII circuits to reduce leakage. PATCH achieves better privacy-utility trade-off than existing defenses, e.g., reducing recall of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis shows that PII leakage circuits persist even after the application of existing defense mechanisms. In contrast, PATCH can effectively mitigate their impact.

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

TL;DR

This work tackles the privacy risk of PII leakage in language models by proposing PATCH, a circuit-aware post-training defense that targets and edits the internal PII leakage pathways identified through mechanistic interpretability. By generating clean/corrupted prompts and applying Edge Attribution Patching with Integrated Gradients, PATCH isolates critical attention-heads and edges, computes shared leakage edges across PII types, and patches them via zero or mean ablation. The approach yields strong privacy improvements with favorable utility trade-offs compared to scrubbing, DP, and prior editing methods, with recall reductions up to 65% and, when combined with DP, residual leakage down to as low as 0.01%. These findings demonstrate that targeted circuit interventions can complement formal privacy guarantees and offer practical, model-sensitive strategies for mitigating PII leakage in diverse transformer architectures.

Abstract

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore, we propose PATCH (Privacy-Aware Targeted Circuit PatcHing), a novel approach that first identifies and subsequently directly edits PII circuits to reduce leakage. PATCH achieves better privacy-utility trade-off than existing defenses, e.g., reducing recall of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis shows that PII leakage circuits persist even after the application of existing defense mechanisms. In contrast, PATCH can effectively mitigate their impact.

Paper Structure

This paper contains 49 sections, 2 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Mitigating PII leakage through circuit analysis of LMs fine-tuned on PII-containing documents (A) with and without privacy defenses. (B) We use Patch to discover PII leaking circuits. Finally, (C) Patch then edits the discovered circuits, reducing PII leaks.
  • Figure 2: Comparison of Patch with other defenses: lower perplexity, precision and recall scores are preferred (closer to lower left corner). Patch consistently performs better than prior defenses across all models.
  • Figure 3: Results for PII leakage and faithfulness: we present the precision (left), and recall (middle) of the PII extracted, and finally, the faithfulness of the discovered PII circuits (right).
  • Figure 4: Results of Patch across varying hyperparameters: we compare Patch-Baseline with Patch-DP($\epsilon=8$) with alternating ablation strategies, zero and mean, and edge thresholds, $95$ and $99$.
  • Figure 5: Influential PII Circuit Components: average EAP-IG scores for each attention head in each layer, across all identified PII circuits.
  • ...and 2 more figures