Adaptive and Robust Watermark for Generative Tabular Data
Dung Daniel Ngo, Archan Ray, Akshay Seshadri, Daniel Scott, Saheed Obitayo, Niraj Kumar, Vamsi K. Potluru, Marco Pistoia, Manuela Veloso
TL;DR
The paper tackles watermarking of generative tabular data by proposing a simple pairwise PAIR algorithm that partitions the feature space into (key, value) column pairs and labels value intervals to embed watermarks. It provides a cohesive theoretical framework with fidelity bounds $\|\mathbf{X}-\mathbf{X}_w\|_\infty$ and Wasserstein distance, a detection procedure based on per-column z-tests with Bonferroni correction, robustness guarantees against additive noise and truncation as well as feature selection, and decoding limits via VC-dimension based sample complexity. Empirically, the method achieves high fidelity and downstream utility comparable to or better than prior work, while demonstrating strong detectability and resilience to attacks, including a spoofing scenario where the proposed method resists simple fractional replacement attacks. The work advances the theoretical understanding of tabular data watermarking and offers practical protection for synthetic data in real-world applications, with extensions to broader data types and generation-time approaches left for future work.
Abstract
In recent years, watermarking generative tabular data has become a prominent framework to protect against the misuse of synthetic data. However, while most prior work in watermarking methods for tabular data demonstrate a wide variety of desirable properties (e.g., high fidelity, detectability, robustness), the findings often emphasize empirical guarantees against common oblivious and adversarial attacks. In this paper, we study a flexible and robust watermarking algorithm for generative tabular data. Specifically, we demonstrate theoretical guarantees on the performance of the algorithm on metrics like fidelity, detectability, robustness, and hardness of decoding. The proof techniques introduced in this work may be of independent interest and may find applicability in other areas of machine learning. Finally, we validate our theoretical findings on synthetic and real-world tabular datasets.
