Table of Contents
Fetching ...

Adaptive and Robust Watermark for Generative Tabular Data

Dung Daniel Ngo, Archan Ray, Akshay Seshadri, Daniel Scott, Saheed Obitayo, Niraj Kumar, Vamsi K. Potluru, Marco Pistoia, Manuela Veloso

TL;DR

The paper tackles watermarking of generative tabular data by proposing a simple pairwise PAIR algorithm that partitions the feature space into (key, value) column pairs and labels value intervals to embed watermarks. It provides a cohesive theoretical framework with fidelity bounds $\|\mathbf{X}-\mathbf{X}_w\|_\infty$ and Wasserstein distance, a detection procedure based on per-column z-tests with Bonferroni correction, robustness guarantees against additive noise and truncation as well as feature selection, and decoding limits via VC-dimension based sample complexity. Empirically, the method achieves high fidelity and downstream utility comparable to or better than prior work, while demonstrating strong detectability and resilience to attacks, including a spoofing scenario where the proposed method resists simple fractional replacement attacks. The work advances the theoretical understanding of tabular data watermarking and offers practical protection for synthetic data in real-world applications, with extensions to broader data types and generation-time approaches left for future work.

Abstract

In recent years, watermarking generative tabular data has become a prominent framework to protect against the misuse of synthetic data. However, while most prior work in watermarking methods for tabular data demonstrate a wide variety of desirable properties (e.g., high fidelity, detectability, robustness), the findings often emphasize empirical guarantees against common oblivious and adversarial attacks. In this paper, we study a flexible and robust watermarking algorithm for generative tabular data. Specifically, we demonstrate theoretical guarantees on the performance of the algorithm on metrics like fidelity, detectability, robustness, and hardness of decoding. The proof techniques introduced in this work may be of independent interest and may find applicability in other areas of machine learning. Finally, we validate our theoretical findings on synthetic and real-world tabular datasets.

Adaptive and Robust Watermark for Generative Tabular Data

TL;DR

The paper tackles watermarking of generative tabular data by proposing a simple pairwise PAIR algorithm that partitions the feature space into (key, value) column pairs and labels value intervals to embed watermarks. It provides a cohesive theoretical framework with fidelity bounds and Wasserstein distance, a detection procedure based on per-column z-tests with Bonferroni correction, robustness guarantees against additive noise and truncation as well as feature selection, and decoding limits via VC-dimension based sample complexity. Empirically, the method achieves high fidelity and downstream utility comparable to or better than prior work, while demonstrating strong detectability and resilience to attacks, including a spoofing scenario where the proposed method resists simple fractional replacement attacks. The work advances the theoretical understanding of tabular data watermarking and offers practical protection for synthetic data in real-world applications, with extensions to broader data types and generation-time approaches left for future work.

Abstract

In recent years, watermarking generative tabular data has become a prominent framework to protect against the misuse of synthetic data. However, while most prior work in watermarking methods for tabular data demonstrate a wide variety of desirable properties (e.g., high fidelity, detectability, robustness), the findings often emphasize empirical guarantees against common oblivious and adversarial attacks. In this paper, we study a flexible and robust watermarking algorithm for generative tabular data. Specifically, we demonstrate theoretical guarantees on the performance of the algorithm on metrics like fidelity, detectability, robustness, and hardness of decoding. The proof techniques introduced in this work may be of independent interest and may find applicability in other areas of machine learning. Finally, we validate our theoretical findings on synthetic and real-world tabular datasets.
Paper Structure (68 sections, 16 theorems, 71 equations, 5 figures, 8 tables, 3 algorithms)

This paper contains 68 sections, 16 theorems, 71 equations, 5 figures, 8 tables, 3 algorithms.

Key Result

Theorem 4.1

Let $\mathbf{X} \in [0,1]^{m \times 2n}$ be a tabular dataset, and $\mathbf{X}_w$ is its watermarked version from alg:tabular. With probability at least $1 - \delta$, for $\delta \in (0,1)$, the entry-wise $\ell_\infty$-distance between $\mathbf{X}$ and $\mathbf{X}_w$ is bounded above by:

Figures (5)

  • Figure 1: Illustrative example of Algorithm \ref{['alg:tabular']} on a tabular dataset with $3$ rows and $4$ columns. This structure corresponds to $2$ pairs of $(key, value)$ columns. In the first row, the element $V_1$ is already in a 'green' interval. Meanwhile, the other elements, $V_2$ and $V_3$, have to be moved from 'red' interval to a nearby 'green' interval.
  • Figure 2: (a, b) KDE plot of the Gaussian data before and after watermarking respectively. (c) Smaller bin sizes result in higher fidelity (lower RMSE). (d) $p$-value remain below $0.05$ across datasets. (e) Smaller bin sizes are more susceptible to zero-mean Gaussian noise with standard deviation sigma. (f) At $60 \%$ of data processed by \ref{['alg:fractional']}, WGTD can be spoofed while \ref{['alg:tabular']} remains robust.
  • Figure 3: Robustness under PAIR. Each experiment is repeated 5 times, and the mean and standard deviation are reported. Across all datasets, feature importance pairing retains more column pairs than random pairing.
  • Figure 4: Neural network architecture to embed the problem of learning the query function. The activation function is applied at every node, except the nodes in the input and the output layers.
  • Figure 5: Neural network architecture to embed the problem of learning the query function for dataset watermarked according to \ref{['alg:tabular']}. The activation function is applied at every node, except the nodes in the input and the output layers.

Theorems & Definitions (33)

  • Theorem 4.1: Fidelity
  • Corollary 4.2: Wasserstein distance
  • Lemma 4.3
  • Theorem : \ref{['thm:robustness']}, Informal
  • Theorem 4.4
  • Lemma : \ref{['lem:robustness-prob-bound']}, Informal
  • Theorem : \ref{['thm:query-complexity-alg-tab']}, Informal
  • proof
  • Remark E.1: Downstream processing
  • proof
  • ...and 23 more