Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

Jinxin Liu; Zao Yang

Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

Jinxin Liu, Zao Yang

TL;DR

This work tackles privacy leakage tracing in large language models by applying Influence Functions and identifying that tokens with large gradient norms can mislead tracing. It introduces Heuristically Adjusted IF (HAIF), which downweights high-gradient tokens to produce more accurate tracing of the actual training data responsible for leakage. Two groundtruth datasets, PII-E and PII-CR, enable concrete evaluation of extraction and reasoning leakage, with HAIF significantly outperforming state-of-the-art IFs across GPT2 and QWen-1.5 models and showing robustness on real-world CLUECorpus2020 data. The findings suggest HAIF as a practical, scalable tool for auditing and mitigating training-data privacy leaks in diverse LM settings.

Abstract

The responses generated by Large Language Models (LLMs) can include sensitive information from individuals and organizations, leading to potential privacy leakage. This work implements Influence Functions (IFs) to trace privacy leakage back to the training data, thereby mitigating privacy concerns of Language Models (LMs). However, we notice that current IFs struggle to accurately estimate the influence of tokens with large gradient norms, potentially overestimating their influence. When tracing the most influential samples, this leads to frequently tracing back to samples with large gradient norm tokens, overshadowing the actual most influential samples even if their influences are well estimated. To address this issue, we propose Heuristically Adjusted IF (HAIF), which reduces the weight of tokens with large gradient norms, thereby significantly improving the accuracy of tracing the most influential samples. To establish easily obtained groundtruth for tracing privacy leakage, we construct two datasets, PII-E and PII-CR, representing two distinct scenarios: one with identical text in the model outputs and pre-training data, and the other where models leverage their reasoning abilities to generate text divergent from pre-training data. HAIF significantly improves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E dataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs on real-world pretraining data CLUECorpus2020, demonstrating strong robustness regardless prompt and response lengths.

Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

TL;DR

Abstract

Paper Structure (30 sections, 9 theorems, 50 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 9 theorems, 50 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Related Work and Preliminaries
Hessian-based Influence Functions
Training Trajectory-based Influence Functions
Methodology
Problem Statement
Conditions for Using IFs in Deep Learning
Adjusted Influence Functions
Experiments
Datasets
PII-E
PII-CR
Privacy Learning Abilities of LM
Dataset Validation
Privacy Tracing Accuracy
...and 15 more sections

Key Result

Lemma 1

Assume that model is trained with mini-batch SGD, where $l$ is $C^2(\theta) \cap C(\epsilon_{k,j})$, and parameters satisfy eq:param update. The influence of down-weighting $z_{kj}$ on parameters is given by: where $H_{0,t}=\sum_{i=1}^n B_{z_i,t} \frac{\partial^2 L}{\partial \theta^2}$ and $\theta_{0,t}$ are the Hessian matrix and model parameters at $t$ step without altering the weight of $z_{k,

Figures (4)

Figure 1: Comparison of Parameter Estimation Errors and Gradient Norms. We trained a Logistic Regression model using an SGD optimizer on the 10-class MNIST dataset. For HIF, we calculate the perturbed inverse of the Hessian matrix, and batch information is also brought into consideration when using TTIF.
Figure 2: Tracing accuracy on CLUECorpus2020 dataset with different token offsets and lengths.
Figure 3: PII Prediction Accuracy of GPT2 and QWen1.5. The red dashed line represents the random guess performance of the QWen1.5-0.5B model, trained exclusively with instruction data
Figure 4: Comparison of Token Influences and Token Gradient Norms

Theorems & Definitions (13)

Lemma 1
Theorem 1
Corollary 1
Corollary 2
Lemma
proof
Theorem
proof
Corollary
proof
...and 3 more

Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

TL;DR

Abstract

Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (13)