TabularMark: Watermarking Tabular Datasets for Machine Learning

Yihao Zheng; Haocheng Xia; Junyuan Pang; Jinfei Liu; Kui Ren; Lingyang Chu; Yang Cao; Li Xiong

TabularMark: Watermarking Tabular Datasets for Machine Learning

Yihao Zheng, Haocheng Xia, Junyuan Pang, Jinfei Liu, Kui Ren, Lingyang Chu, Yang Cao, Li Xiong

TL;DR

TabularMark introduces a hypothesis-testing–based watermarking scheme for tabular datasets that preserves ML utility while enabling reliable ownership verification. It embeds a watermark by partitioning a perturbation range into green and red domains and perturbing a small set of key cells, with detection via a one-proportion z-test that controls false positives. The approach is complemented by a matching mechanism using MSBs to counter primary-key replacement and a theoretical analysis showing robustness against common attacks. Empirical results across real and synthetic datasets demonstrate strong detectability, minimal impact on downstream ML tasks, and robust resistance to alteration, insertion, deletion, and other attacks, outperforming comparable schemes in non-intrusiveness and maintaining model performance. The work provides practical guidelines for parameter choices and introduces an optimization to further reduce data distortion while preserving watermark detectability.

Abstract

Watermarking is broadly utilized to protect ownership of shared data while preserving data utility. However, existing watermarking methods for tabular datasets fall short on the desired properties (detectability, non-intrusiveness, and robustness) and only preserve data utility from the perspective of data statistics, ignoring the performance of downstream ML models trained on the datasets. Can we watermark tabular datasets without significantly compromising their utility for training ML models while preventing attackers from training usable ML models on attacked datasets? In this paper, we propose a hypothesis testing-based watermarking scheme, TabularMark. Data noise partitioning is utilized for data perturbation during embedding, which is adaptable for numerical and categorical attributes while preserving the data utility. For detection, a custom-threshold one proportion z-test is employed, which can reliably determine the presence of the watermark. Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.

TabularMark: Watermarking Tabular Datasets for Machine Learning

TL;DR

Abstract

Paper Structure (22 sections, 3 theorems, 10 equations, 10 figures, 20 tables, 3 algorithms)

This paper contains 22 sections, 3 theorems, 10 equations, 10 figures, 20 tables, 3 algorithms.

Introduction
Related Work
Algorithms
Threat Model
Watermark Embedding
Watermark Detection
One Proportion Z-test
Detection
Matching
Analysis on Watermark Removal
Experiments and Analysis
Detectability
Non-Intrusiveness
Robustness
Comparison
...and 7 more sections

Key Result

proposition 1

Denote by $n_h$ the amount of cells an attacker tamper with. When each cell is added an i.i.d noise from uniform distribution $\epsilon \sim U[-2\sigma,2\sigma]$. The expectation of $n_h$ for the watermark removing is where $n_\alpha = \alpha \sqrt{\frac{n_w}{4}}+\frac{n_w}{2}$ is the least amount of green cells to make the z-score achieve $\alpha$ and $p_\sigma=0.5-\frac{p}{4k\sigma}$. For examp

Figures (10)

Figure 1: Flowchart of TabularMark, where $D_o$ is the original dataset, $D_w$ is the watermarked dataset, $D_a$ is the watermarked dataset after suffered attacks.
Figure 2: An example of domain partition.
Figure 3: An example of watermark embedding.
Figure 4: An example for watermark detection.
Figure 5: An example of matching tuples.
...and 5 more figures

Theorems & Definitions (3)

proposition 1
proposition 2
proposition 3

TabularMark: Watermarking Tabular Datasets for Machine Learning

TL;DR

Abstract

TabularMark: Watermarking Tabular Datasets for Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)