Differentially Private Multimodal In-Context Learning

Ivoline C. Ngong; Zarreen Reza; Joseph P. Near

Differentially Private Multimodal In-Context Learning

Ivoline C. Ngong, Zarreen Reza, Joseph P. Near

TL;DR

Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, \delta)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space, is presented.

Abstract

Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.

Differentially Private Multimodal In-Context Learning

TL;DR

Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal

-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space, is presented.

Abstract

-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At

, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.

Paper Structure (47 sections, 3 theorems, 5 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 47 sections, 3 theorems, 5 equations, 4 figures, 2 tables, 2 algorithms.

Introduction
Contributions.
Preliminaries
Differential Privacy
Multimodal Task Vectors
Privacy in In-Context Learning
Differentially Private Multimodal Task Vectors
Problem Setup
Threat Model.
Overview
Construction Phase
Private Mean Activations
Attention Head Selection
Public-Data Variant.
Private-Only Variant.
...and 32 more sections

Key Result

Theorem 3.1

Disjoint partitioning ensures each record affects exactly one chunk, and per-layer clipping bounds each chunk's contribution. The $\ell_2$-sensitivity is $\Delta_2 = \sqrt{|\mathcal{S}|} \cdot C / m$, which is $\sqrt{H}$ times smaller than per-head clipping, reducing noise by a factor of $\sqrt{H} \

Figures (4)

Figure 1: DP-MTV approach. Construction (offline): Partition data into disjoint chunks, extract and clip activations, compute the mean, add Gaussian noise (steps 1--2), then select heads via public data or a private mechanism (step 3). Inference (online): Replace activations at selected heads with private task vectors. Post-processing enables unlimited queries at no additional privacy cost.
Figure 2: Model comparison at $\varepsilon=1.0$. Brackets show the baseline gap (MTV $-$ Zero-shot); larger gaps predict better DP-MTV performance. (a) VizWiz has the largest gaps and strongest DP-MTV results. (b) On 2-way classification, DP-MTV often matches or exceeds MTV.
Figure 3: Privacy-utility tradeoffs for Qwen-VL across privacy budgets. Dashed lines: non-private MTV; dotted lines: zero-shot. (a) On VQA, DP-MTV performs best on datasets where MTV provides the largest gains over zero-shot (e.g., VizWiz); performance varies by architecture (Figure \ref{['fig:model_comparison']}). (b) On classification, DP-MTV matches or exceeds MTV at practical privacy budgets.
Figure 4: Effect of clipping threshold $C$ on VizWiz accuracy (Qwen-VL, $\varepsilon=1.0$). $C=1.0$ balances signal preservation with noise calibration. Performance degrades gradually at higher thresholds.

Theorems & Definitions (7)

Definition 2.1: $(\varepsilon, \delta)$-Differential Privacy
Theorem 3.1: Private Mean Activations
Theorem 3.2: Public-Data Variant
Theorem 3.3: Private-Only Variant
proof : Proof of Theorem \ref{['thm:task_vector_privacy']}
proof : Proof of Theorem \ref{['thm:public_variant']}
proof : Proof of Theorem \ref{['thm:private_variant']}

Differentially Private Multimodal In-Context Learning

TL;DR

Abstract

Differentially Private Multimodal In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)