Watermarking Generative Tabular Data

Hengzhi He; Peiyu Yu; Junpeng Ren; Ying Nian Wu; Guang Cheng

Watermarking Generative Tabular Data

Hengzhi He, Peiyu Yu, Junpeng Ren, Ying Nian Wu, Guang Cheng

TL;DR

This work introduces a simple, binning-based watermarking scheme for generative tabular data that embeds watermarks into selected green-list intervals while preserving data fidelity under a $1/m$ distortion bound. Detection is grounded in a principled hypothesis-testing framework that yields a chi-square statistic with asymptotic independence across columns, enabling robust watermark detection even in high-dimensional settings. The paper proves fidelity guarantees, analyzes robustness against additive-noise attacks, and demonstrates strong empirical performance on synthetic and real-world datasets using multiple tabular generators, with negligible impact on downstream utility. Overall, the approach provides a theoretically solid and practically effective method to watermark tabular data for security and traceability in AI-generated datasets.

Abstract

In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature's value range into finely segmented intervals and embeds watermarks into selected ``green list" intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.

Watermarking Generative Tabular Data

TL;DR

This work introduces a simple, binning-based watermarking scheme for generative tabular data that embeds watermarks into selected green-list intervals while preserving data fidelity under a

distortion bound. Detection is grounded in a principled hypothesis-testing framework that yields a chi-square statistic with asymptotic independence across columns, enabling robust watermark detection even in high-dimensional settings. The paper proves fidelity guarantees, analyzes robustness against additive-noise attacks, and demonstrates strong empirical performance on synthetic and real-world datasets using multiple tabular generators, with negligible impact on downstream utility. Overall, the approach provides a theoretically solid and practically effective method to watermark tabular data for security and traceability in AI-generated datasets.

Abstract

Paper Structure (48 sections, 9 theorems, 61 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 48 sections, 9 theorems, 61 equations, 5 figures, 11 tables, 1 algorithm.

Introduction
Watermarking Tabular Data
Problem Statement
Watermarking Tabular Data with Data Binning
An illustrative example
Tabular watermark with marginal data distortion
Detection of the Tabular Data Watermark
Robustness of the Tabular Data Watermark
Experiments
Synthetic Dataset Examples
Fidelity
Detection rate (True Postive Rate)
Robustness
Results on Generative Tabular Data
Datasets & tab. generators
...and 33 more sections

Key Result

Theorem 1

Let $\mathbf{X}$ be a $n \times p$ dataframe, and let $\mathbf{X}_w$ denote its watermarked version. Conditioned on $\mathbf{X}$, it holds with probability one that where $m$ is the number of "green list" intervals, a parameter controlling the granularity of the watermarking process.

Figures (5)

Figure 1: Overview of the tabular data watermarking scheme.
Figure 2: Illustration of our proposed watermarking scheme for tabular data. Specifically, our scheme consists of three major steps: i) dividing the continuous interval $[0, 1]$ into $2m$ equal parts, forming $m$ pairs of consecutive intervals; ii) randomly selecting one interval from each pair, resulting in the set of $m$ "green list" intervals; and iii) sampling new fractional part for the input element from the nearest "green list" interval if the original fractional part falls outside of this interval.
Figure 3: KDE plots for the Gaussian data w/ and w/o our proposed watermark;wm as the shorthand of our watermark; figs and tabs henceforth follows this format.
Figure 5: Detection rates of the proposed watermark applied to tabular data with different number of rows and columns. In \ref{['subfig:chi_sing_col_atck', 'subfig:chi_mult_col_atck']} we plot the detection rates of the watermark after adding noises with different level of variances (in $\log_{10}$ scale). prp is the proportion of the elements in a table being modified. Rates over $1000$ independent samples; error bars over 3 runs. Zoom-in for more details.
Figure A1: Histogram of spiky column data distributions. Examples generated by TabDDPM.

Theorems & Definitions (21)

Theorem 1: Fidelity
Corollary 1.1
Remark 1
Lemma 1: Prelim. for detection
Remark 2
Theorem 2: Asymptotic independence
Remark 3
Theorem 3
Theorem 4: Attack success rate
proof
...and 11 more

Watermarking Generative Tabular Data

TL;DR

Abstract

Watermarking Generative Tabular Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (21)