Provable Differentially Private Computation of the Cross-Attention Mechanism

Yekun Ke; Yingyu Liang; Zhenmei Shi; Zhao Song; Jiahao Zhang

Provable Differentially Private Computation of the Cross-Attention Mechanism

Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang

TL;DR

The paper tackles privacy breaches in cross-attention for large generative models by proving differential privacy guarantees for the cross-attention mechanism. It recasts cross-attention as a weighted distance problem and builds a family of DP data structures (DPTree) that implement private Softmax queries via polynomial kernel approximations, achieving memory \tilde{O}(n d r^2), initialization time \tilde{O}(n d r^2), and per-token query time \tilde{O}(d r^2), while ensuring robustness to adaptive queries with explicit additive and relative error bounds. The approach yields a provable-DP mechanism for cross-attention, the first of its kind, with explicit handling for adaptive queries and extensions to Softmax and high-dimensional settings through DPTreeSoftmax and DPTreeHighDim. This work provides a principled privacy foundation for prompts, RAG data, and other external-context data in LGMs, enabling privacy-preserving deployment in system prompts, retrieval pipelines, and diffusion-based applications.

Abstract

Cross-attention has emerged as a cornerstone module in modern artificial intelligence, underpinning critical applications such as retrieval-augmented generation (RAG), system prompting, and guided stable diffusion. However, this is a rising concern about securing the privacy of cross-attention, as the underlying key and value matrices frequently encode sensitive data or private user information. In this work, we introduce a novel data structure designed to enforce differential privacy (DP) for cross-attention mechanisms, accompanied by provable theoretical guarantees. Specifically, letting $n$ denote the input sequence length, $d$ the feature dimension, $R$ the maximum magnitude of query and key matrices, $R_w$ the maximum magnitude of the value matrix, and $r, s, ε_s$ the parameters for polynomial kernel methods, our proposed structure achieves $\widetilde{O}(ndr^2)$ space and initialization complexity, with a query time of $\widetilde{O}(d r^2)$ per token. Moreover, we demonstrate that our mechanism satisfies $(ε, δ)$-DP, incurring an additive error of $\widetilde{O}((1-ε_s)^{-1} n^{-1} ε^{-1} R^{2s} R_w r^2)$ and a relative error of $2ε_s/(1-ε_s)$ with respect to the ground truth. Crucially, our framework maintains robustness against adaptive queries, ensuring security even in adversarial settings. To the best of our knowledge, this constitutes the first approach providing provable differential privacy for cross-attention, establishing a foundation for future privacy-preserving algorithms in large generative models (LGMs).

Provable Differentially Private Computation of the Cross-Attention Mechanism

TL;DR

Abstract

denote the input sequence length,

the feature dimension,

the maximum magnitude of query and key matrices,

the maximum magnitude of the value matrix, and

the parameters for polynomial kernel methods, our proposed structure achieves

space and initialization complexity, with a query time of

per token. Moreover, we demonstrate that our mechanism satisfies

-DP, incurring an additive error of

and a relative error of

with respect to the ground truth. Crucially, our framework maintains robustness against adaptive queries, ensuring security even in adversarial settings. To the best of our knowledge, this constitutes the first approach providing provable differential privacy for cross-attention, establishing a foundation for future privacy-preserving algorithms in large generative models (LGMs).

Paper Structure (43 sections, 33 theorems, 73 equations, 1 figure, 7 algorithms)

This paper contains 43 sections, 33 theorems, 73 equations, 1 figure, 7 algorithms.

Introduction
Related Works
Differential Privacy Guarantee Analysis.
Differential Privacy in Data Structure and Attention.
Cross-Attention in System Prompt, RAG, Stable Diffusion and More.
Preliminary
Main Results: Cross-Attention
Key Data Structure: DPTree
Technique Overview
DPTree, DPTreeDistance, and DPTreeHighDim
Softmax Activation
Adaptive Query Data Structure
Discussion
Conclusion
Roadmap.
...and 28 more sections

Key Result

Theorem 1.2

Let $Q,K,V, \mathrm{Attn}$ be defined in Definition def:cross. Let $p_f$ be the probability of failure parameter. Let $r,s,\epsilon_s$ be the parameters of the polynomial kernel methods (Lemma lem:exp_inner_prod:formal). Then, our Algorithm alg:DP_cross_attn requires $\widetilde{O}(ndr^2)$ total mem

Figures (1)

Figure 1: The visualization of how to compute the weighted $\ell_1$ distance for rounded dataset $X \in [0,1]^{10}$. The number above each $x_i$ is $w_i$. See Algorithm \ref{['alg:preprocessing_one_d']} for details. Suppose $y=0$. Then $\sum_{i = 1}^n w_i |y - x_i| = 0.1 \cdot 2.2 + 0.3 \cdot 3.1 + 0.3 \cdot (-2) + 0.3 \cdot (-3) + 0.4 \cdot 2 + 0.6 \cdot 6 + 0.7 \cdot 0.5 + 0.9 \cdot (-1) + 0.9 \cdot 1 = 4.4$. See more details in Lemma \ref{['lem:weighted_l1']}.

Theorems & Definitions (70)

Definition 1.1: Softmax cross-attention, vsp+17
Theorem 1.2: Main result; Informal version of Theorem \ref{['thm:cross_attention']}
Definition 3.1: Neighboring dataset
Definition 3.2: Sensitivity
Definition 3.3: $(\epsilon, \delta)$-DP
Definition 3.4: Truncated Laplace distribution, gdgk20
Lemma 3.6: Laplace mechanism, dr14gdgk20, see Lemma 2.2 in aimn23
Theorem 4.1: Softmax cross-attention, informal version of Theorem \ref{['thm:cross_attention:formal']}
Remark 4.2
Definition 5.1: Weighted Softmax query (without normalization)
...and 60 more

Provable Differentially Private Computation of the Cross-Attention Mechanism

TL;DR

Abstract

Provable Differentially Private Computation of the Cross-Attention Mechanism

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (70)