Table of Contents
Fetching ...

TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data

Yizhou Zhao, Xiang Li, Peter Song, Qi Long, Weijie Su

TL;DR

This work addresses the challenge of tracing provenance in high-fidelity synthetic tabular data by introducing TAB-DRW, a post-editing watermarking method that embeds signals in the frequency domain after Yeo-Johnson transformation and standardization. It introduces a novel rank-based pseudorandom bit generation scheme enabling efficient, row-wise retrieval without storing per-row bits, and supports both hard and soft imaginary-part modifications of the DFT to balance fidelity and detectability. The paper provides theoretical distortion and robustness guarantees and validates the approach on five real datasets, demonstrating strong detectability, robustness to post-processing and adaptive attacks, and high data utility. A privacy-enhanced variant enables multi-key deployments with low false-positive risk, highlighting TAB-DRW’s practicality for provenance of generative tabular data in real-world settings.

Abstract

The rise of generative AI has enabled the production of high-fidelity synthetic tabular data across fields such as healthcare, finance, and public policy, raising growing concerns about data provenance and misuse. Watermarking offers a promising solution to address these concerns by ensuring the traceability of synthetic data, but existing methods face many limitations: they are computationally expensive due to reliance on large diffusion models, struggle with mixed discrete-continuous data, or lack robustness to post-modifications. To address them, we propose TAB-DRW, an efficient and robust post-editing watermarking scheme for generative tabular data. TAB-DRW embeds watermark signals in the frequency domain: it normalizes heterogeneous features via the Yeo-Johnson transformation and standardization, applies the discrete Fourier transform (DFT), and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits. To further enhance robustness and efficiency, we introduce a novel rank-based pseudorandom bit generation method that enables row-wise retrieval without incurring storage overhead. Experiments on five benchmark tabular datasets show that TAB-DRW achieves strong detectability and robustness against common post-processing attacks, while preserving high data fidelity and fully supporting mixed-type features.

TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data

TL;DR

This work addresses the challenge of tracing provenance in high-fidelity synthetic tabular data by introducing TAB-DRW, a post-editing watermarking method that embeds signals in the frequency domain after Yeo-Johnson transformation and standardization. It introduces a novel rank-based pseudorandom bit generation scheme enabling efficient, row-wise retrieval without storing per-row bits, and supports both hard and soft imaginary-part modifications of the DFT to balance fidelity and detectability. The paper provides theoretical distortion and robustness guarantees and validates the approach on five real datasets, demonstrating strong detectability, robustness to post-processing and adaptive attacks, and high data utility. A privacy-enhanced variant enables multi-key deployments with low false-positive risk, highlighting TAB-DRW’s practicality for provenance of generative tabular data in real-world settings.

Abstract

The rise of generative AI has enabled the production of high-fidelity synthetic tabular data across fields such as healthcare, finance, and public policy, raising growing concerns about data provenance and misuse. Watermarking offers a promising solution to address these concerns by ensuring the traceability of synthetic data, but existing methods face many limitations: they are computationally expensive due to reliance on large diffusion models, struggle with mixed discrete-continuous data, or lack robustness to post-modifications. To address them, we propose TAB-DRW, an efficient and robust post-editing watermarking scheme for generative tabular data. TAB-DRW embeds watermark signals in the frequency domain: it normalizes heterogeneous features via the Yeo-Johnson transformation and standardization, applies the discrete Fourier transform (DFT), and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits. To further enhance robustness and efficiency, we introduce a novel rank-based pseudorandom bit generation method that enables row-wise retrieval without incurring storage overhead. Experiments on five benchmark tabular datasets show that TAB-DRW achieves strong detectability and robustness against common post-processing attacks, while preserving high data fidelity and fully supporting mixed-type features.

Paper Structure

This paper contains 73 sections, 8 theorems, 61 equations, 9 figures, 22 tables, 3 algorithms.

Key Result

Proposition 1

Let $S \subseteq \{1, \dots, m\}$ with $m = \lfloor \frac{p-1}{2} \rfloor$ denote the set of frequency coordinates whose imaginary signs are modified by our watermarking method. Let $\Delta x_{i,j} = x_{i,j}^{\mathrm{wm}} - x_{i,j}$ denote the entry-wise difference. Then where $\boldsymbol{\beta}_j = (\beta_S(0,j), \dots, \beta_S(p-1,j))^\top$, and $\beta_S(n,j) = \sum_{k \in S} \sin\left(\frac{2

Figures (9)

  • Figure 1: Our proposed watermarking scheme, Tab-Drw, embeds watermarks into tabular data by modifying the imaginary components of the frequency-domain representation to align with pseudorandom bits. Detection evaluates the degree of alignment: strong alignment indicates watermarked data, while weak alignment suggests non-watermarked data.
  • Figure 2: In the proposed pseudorandom bit generation scheme, bit sequence for each row is generated by mapping a row-wise rank statistic to a leaf node in a binary tree.
  • Figure 3: Trade-off between average Z-score on 1K-rows tables and data fidelity under varying $(\gamma, \delta)$.
  • Figure 4: TPR@0.1%FPR versus row count under three representative attacks. Dashed lines show the bootstrap mean estimate (500 resamples), and shaded regions indicate the 90% confidence interval.
  • Figure 5: Visualization of the gender-flipping case study on the Adult dataset. Each subfigure corresponds to a synthetic 5K-row table pair (watermarked vs. unwatermarked). "Flip (misclustered)" denotes samples whose original "gender" label conflicts with their cluster label and subsequently flips after watermarking. "Flip (aligned)" denotes samples whose original "gender" label matches the cluster label but still flips after watermarking. In subfigure (a), 26 out of 5K samples exhibit a gender flip (46.2% misclustered); their mean distance to the cluster boundary is 0.31, compared with 0.78 for the remaining samples. In subfigure (b), 27 out of 5K samples exhibit a gender flip (48.1% misclustered); their mean distance to the cluster boundary is 0.26, compared with 0.75 for the remaining samples. In subfigure (c), 33 out of 5K samples exhibit a gender flip (48.5% misclustered); their mean distance to the cluster boundary is 0.29, compared with 0.77 for the remaining samples.
  • ...and 4 more figures

Theorems & Definitions (22)

  • Definition 1: YJT
  • Definition 2: DFT and IDFT
  • Remark 1: Related work
  • Remark 2: Column selection for watermarking
  • Proposition 1: Entry-wise differences
  • Theorem 1: Column-wise differences
  • Theorem 2: Robustness
  • Remark 3
  • proof : Proof of Proposition \ref{['prop:perturbation']}
  • Lemma 1: Gaussian tail bound
  • ...and 12 more