Table of Contents
Fetching ...

DP-TabICL: In-Context Learning with Differentially Private Tabular Data

Alycia N. Carey, Karuna Bhaila, Kennedy Edemacu, Xintao Wu

TL;DR

The paper tackles privacy risks in in-context learning over tabular data by introducing two DP-based frameworks, LDP-TabICL (local DP via randomized response with reconstruction) and GDP-TabICL (global DP via private averages with subsampling). It demonstrates that DP-protected tabular demonstrations can be serialized into prompts and used to query untrusted LLMs with competitive accuracy, especially under stringent privacy, across eight real-world datasets. Key contributions include formal DP guarantees for both local and global settings, a simple serialization pipeline, and extensive empirical analysis comparing against strong baselines and model sizes ($\text{Llama-2-13B}$ vs $\text{Llama-2-7B}$). The findings support the viability of privacy-preserving ICL for tabular data and offer practical guidance on when to apply LDP vs GDP approaches, with implications for privacy-aware data sharing and prompt-based learning in sensitive domains.

Abstract

In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. However, it has been shown that LLMs can leak information contained in prompts, and since tabular data often contain sensitive information, understanding how to protect the underlying tabular data used in ICL is a critical area of research. This work serves as an initial investigation into how to use differential privacy (DP) -- the long-established gold standard for data privacy and anonymization -- to protect tabular data used in ICL. Specifically, we investigate the application of DP mechanisms for private tabular ICL via data privatization prior to serialization and prompting. We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL) DP scenarios via injecting noise into individual records or group statistics, respectively. We evaluate our DP-based frameworks on eight real-world tabular datasets and across multiple ICL and DP settings. Our evaluations show that DP-based ICL can protect the privacy of the underlying tabular data while achieving comparable performance to non-LLM baselines, especially under high privacy regimes.

DP-TabICL: In-Context Learning with Differentially Private Tabular Data

TL;DR

The paper tackles privacy risks in in-context learning over tabular data by introducing two DP-based frameworks, LDP-TabICL (local DP via randomized response with reconstruction) and GDP-TabICL (global DP via private averages with subsampling). It demonstrates that DP-protected tabular demonstrations can be serialized into prompts and used to query untrusted LLMs with competitive accuracy, especially under stringent privacy, across eight real-world datasets. Key contributions include formal DP guarantees for both local and global settings, a simple serialization pipeline, and extensive empirical analysis comparing against strong baselines and model sizes ( vs ). The findings support the viability of privacy-preserving ICL for tabular data and offer practical guidance on when to apply LDP vs GDP approaches, with implications for privacy-aware data sharing and prompt-based learning in sensitive domains.

Abstract

In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. However, it has been shown that LLMs can leak information contained in prompts, and since tabular data often contain sensitive information, understanding how to protect the underlying tabular data used in ICL is a critical area of research. This work serves as an initial investigation into how to use differential privacy (DP) -- the long-established gold standard for data privacy and anonymization -- to protect tabular data used in ICL. Specifically, we investigate the application of DP mechanisms for private tabular ICL via data privatization prior to serialization and prompting. We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL) DP scenarios via injecting noise into individual records or group statistics, respectively. We evaluate our DP-based frameworks on eight real-world tabular datasets and across multiple ICL and DP settings. Our evaluations show that DP-based ICL can protect the privacy of the underlying tabular data while achieving comparable performance to non-LLM baselines, especially under high privacy regimes.
Paper Structure (38 sections, 6 theorems, 3 equations, 2 figures, 13 tables, 2 algorithms)

This paper contains 38 sections, 6 theorems, 3 equations, 2 figures, 13 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathcal{M}_1:\mathbb{N}^{|\mathcal{X}|}\to\mathcal{R}_1$ be an $\epsilon_1$-DP algorithm, and let $\mathcal{M}_2:\mathbb{N}^{|\mathcal{X}|}\to\mathcal{R}_2$ be an $\epsilon_2$-DP algorithm. Then their combination, defined to be $\mathcal{M}_{1,2}:\mathbb{N}^{|\mathcal{X}|}\to\mathcal{R}_1\time

Figures (2)

  • Figure 1: Overview of our two proposed methods methods. In Phase I, DP-protected demonstration examples are generated using either LDP-TabICL (top left) or GDP-TabICL (bottom left). In Phase II, the generated DP-protected demonstration examples are appended with a non-private query from a user to construct the prompt. In Phase III, the constructed prompt is sent to an untrusted LLM and the answer is returned to the user. More details are presented in Section \ref{['sec:method']}.
  • Figure 2: Accuracy of GDP-TabICL when the size of $\mathcal{S}$ is altered for the Adult (top) and Heart (bottom) datasets with $\epsilon=5$.

Theorems & Definitions (9)

  • Definition 1: $(\epsilon,\delta)$-DP dwork2014algorithmic
  • Definition 2: Laplace Mechanism dwork2014algorithmic
  • Theorem 1: Sequential Composition dwork2014algorithmic
  • Theorem 2: Parallel Composition
  • Proposition 1: Post-processing dwork2014algorithmic
  • Definition 3: Differential Privacy under Sampling li2012sampling
  • Theorem 3: Amplification Effect of Sampling li2012sampling
  • Theorem 4
  • Theorem 5