Table of Contents
Fetching ...

Watermarking Generative Tabular Data

Hengzhi He, Peiyu Yu, Junpeng Ren, Ying Nian Wu, Guang Cheng

TL;DR

This work introduces a simple, binning-based watermarking scheme for generative tabular data that embeds watermarks into selected green-list intervals while preserving data fidelity under a $1/m$ distortion bound. Detection is grounded in a principled hypothesis-testing framework that yields a chi-square statistic with asymptotic independence across columns, enabling robust watermark detection even in high-dimensional settings. The paper proves fidelity guarantees, analyzes robustness against additive-noise attacks, and demonstrates strong empirical performance on synthetic and real-world datasets using multiple tabular generators, with negligible impact on downstream utility. Overall, the approach provides a theoretically solid and practically effective method to watermark tabular data for security and traceability in AI-generated datasets.

Abstract

In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature's value range into finely segmented intervals and embeds watermarks into selected ``green list" intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.

Watermarking Generative Tabular Data

TL;DR

This work introduces a simple, binning-based watermarking scheme for generative tabular data that embeds watermarks into selected green-list intervals while preserving data fidelity under a distortion bound. Detection is grounded in a principled hypothesis-testing framework that yields a chi-square statistic with asymptotic independence across columns, enabling robust watermark detection even in high-dimensional settings. The paper proves fidelity guarantees, analyzes robustness against additive-noise attacks, and demonstrates strong empirical performance on synthetic and real-world datasets using multiple tabular generators, with negligible impact on downstream utility. Overall, the approach provides a theoretically solid and practically effective method to watermark tabular data for security and traceability in AI-generated datasets.

Abstract

In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature's value range into finely segmented intervals and embeds watermarks into selected ``green list" intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.
Paper Structure (48 sections, 9 theorems, 61 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 48 sections, 9 theorems, 61 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathbf{X}$ be a $n \times p$ dataframe, and let $\mathbf{X}_w$ denote its watermarked version. Conditioned on $\mathbf{X}$, it holds with probability one that where $m$ is the number of "green list" intervals, a parameter controlling the granularity of the watermarking process.

Figures (5)

  • Figure 1: Overview of the tabular data watermarking scheme.
  • Figure 2: Illustration of our proposed watermarking scheme for tabular data. Specifically, our scheme consists of three major steps: i) dividing the continuous interval $[0, 1]$ into $2m$ equal parts, forming $m$ pairs of consecutive intervals; ii) randomly selecting one interval from each pair, resulting in the set of $m$ "green list" intervals; and iii) sampling new fractional part for the input element from the nearest "green list" interval if the original fractional part falls outside of this interval.
  • Figure 3: KDE plots for the Gaussian data w/ and w/o our proposed watermark;wm as the shorthand of our watermark; figs and tabs henceforth follows this format.
  • Figure 5: Detection rates of the proposed watermark applied to tabular data with different number of rows and columns. In \ref{['subfig:chi_sing_col_atck', 'subfig:chi_mult_col_atck']} we plot the detection rates of the watermark after adding noises with different level of variances (in $\log_{10}$ scale). prp is the proportion of the elements in a table being modified. Rates over $1000$ independent samples; error bars over 3 runs. Zoom-in for more details.
  • Figure A1: Histogram of spiky column data distributions. Examples generated by TabDDPM.

Theorems & Definitions (21)

  • Theorem 1: Fidelity
  • Corollary 1.1
  • Remark 1
  • Lemma 1: Prelim. for detection
  • Remark 2
  • Theorem 2: Asymptotic independence
  • Remark 3
  • Theorem 3
  • Theorem 4: Attack success rate
  • proof
  • ...and 11 more