Table of Contents
Fetching ...

Watermarking Generative Categorical Data

Bochao Gu, Hengzhi He, Guang Cheng

TL;DR

This method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level.

Abstract

In this paper, we propose a novel statistical framework for watermarking generative categorical data. Our method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level. To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution. Unlike previous categorical watermarking methods, which primarily focus on embedding watermarks into a given dataset, our approach operates at the distribution-level, allowing for verification from a statistical distributional perspective. This makes it particularly well-suited for the modern paradigm of synthetic data generation, where the underlying data distribution, rather than specific data points, is of primary importance. The effectiveness of our method is demonstrated through both theoretical analysis and empirical validation.

Watermarking Generative Categorical Data

TL;DR

This method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level.

Abstract

In this paper, we propose a novel statistical framework for watermarking generative categorical data. Our method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level. To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution. Unlike previous categorical watermarking methods, which primarily focus on embedding watermarks into a given dataset, our approach operates at the distribution-level, allowing for verification from a statistical distributional perspective. This makes it particularly well-suited for the modern paradigm of synthetic data generation, where the underlying data distribution, rather than specific data points, is of primary importance. The effectiveness of our method is demonstrated through both theoretical analysis and empirical validation.

Paper Structure

This paper contains 15 sections, 1 theorem, 19 equations, 7 figures, 5 tables, 5 algorithms.

Key Result

Theorem 3.1

Let $T$ be the unwatermarked table and $\hat{T}_{\mathop{\mathrm{\textit{secret}}}\nolimits}$ be the watermarked table using our watermark insertion algorithm. Let $D$ and $\hat{D}_{\mathop{\mathrm{\textit{secret}}}\nolimits}$ be the underlying distributions of $T$ and $\hat{T}_{\mathop{\mathrm{\tex

Figures (7)

  • Figure 1: $\#(Y) = 3,$ samples 12000 $\mathop{\mathrm{\textit{secret}}}\nolimits$, $x =[2,2,2,23]$
  • Figure 2: $\#(Y) = 5,$ samples 12000 $\mathop{\mathrm{\textit{secret}}}\nolimits$, $x= [2,24,2,23]$
  • Figure 3: $\#(Y) = 10,$ samples 12000 $\mathop{\mathrm{\textit{secret}}}\nolimits$, $x= [233,2,2,23]$
  • Figure 4: p values of replacement attack $p_w = 0.05$
  • Figure 5: $d_{TV}(D, D_{attack,m}^{-1})$ of replacement attack $p_w = 0.05$
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 3.1
  • Theorem 3.1
  • proof