Table of Contents
Fetching ...

Differentially Private Multimodal In-Context Learning

Ivoline C. Ngong, Zarreen Reza, Joseph P. Near

TL;DR

Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, \delta)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space, is presented.

Abstract

Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.

Differentially Private Multimodal In-Context Learning

TL;DR

Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal -differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space, is presented.

Abstract

Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal -differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At , DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.
Paper Structure (47 sections, 3 theorems, 5 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 47 sections, 3 theorems, 5 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

Disjoint partitioning ensures each record affects exactly one chunk, and per-layer clipping bounds each chunk's contribution. The $\ell_2$-sensitivity is $\Delta_2 = \sqrt{|\mathcal{S}|} \cdot C / m$, which is $\sqrt{H}$ times smaller than per-head clipping, reducing noise by a factor of $\sqrt{H} \

Figures (4)

  • Figure 1: DP-MTV approach. Construction (offline): Partition data into disjoint chunks, extract and clip activations, compute the mean, add Gaussian noise (steps 1--2), then select heads via public data or a private mechanism (step 3). Inference (online): Replace activations at selected heads with private task vectors. Post-processing enables unlimited queries at no additional privacy cost.
  • Figure 2: Model comparison at $\varepsilon=1.0$. Brackets show the baseline gap (MTV $-$ Zero-shot); larger gaps predict better DP-MTV performance. (a) VizWiz has the largest gaps and strongest DP-MTV results. (b) On 2-way classification, DP-MTV often matches or exceeds MTV.
  • Figure 3: Privacy-utility tradeoffs for Qwen-VL across privacy budgets. Dashed lines: non-private MTV; dotted lines: zero-shot. (a) On VQA, DP-MTV performs best on datasets where MTV provides the largest gains over zero-shot (e.g., VizWiz); performance varies by architecture (Figure \ref{['fig:model_comparison']}). (b) On classification, DP-MTV matches or exceeds MTV at practical privacy budgets.
  • Figure 4: Effect of clipping threshold $C$ on VizWiz accuracy (Qwen-VL, $\varepsilon=1.0$). $C=1.0$ balances signal preservation with noise calibration. Performance degrades gradually at higher thresholds.

Theorems & Definitions (7)

  • Definition 2.1: $(\varepsilon, \delta)$-Differential Privacy
  • Theorem 3.1: Private Mean Activations
  • Theorem 3.2: Public-Data Variant
  • Theorem 3.3: Private-Only Variant
  • proof : Proof of Theorem \ref{['thm:task_vector_privacy']}
  • proof : Proof of Theorem \ref{['thm:public_variant']}
  • proof : Proof of Theorem \ref{['thm:private_variant']}