Table of Contents
Fetching ...

DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking

Mehrdad Saberi, Vinu Sankar Sadasivan, Arman Zarei, Hessam Mahdavifar, Soheil Feizi

TL;DR

DREW targets robust data provenance by integrating error-controlled watermarking with embedding-based retrieval. It partitions the reference data into $2^k$ clusters, encodes cluster codes with an ECC into $n$-bit watermark keys, and injects them into samples; at query time, decoding yields a reliable cluster hint to restrict search to $X_c$, otherwise the full dataset is searched. Embedding similarities via $\phi$ identify the closest match within the chosen space, improving accuracy under edits while preserving baseline performance when reliability checks fail. Empirical results on multiple image datasets with DinoV2 and CLIP embeddings show gains up to substantial margins on challenging augmentations, with small false-positive rates and clear guidance for future improvements through more robust watermarking techniques.

Abstract

Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW

DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking

TL;DR

DREW targets robust data provenance by integrating error-controlled watermarking with embedding-based retrieval. It partitions the reference data into clusters, encodes cluster codes with an ECC into -bit watermark keys, and injects them into samples; at query time, decoding yields a reliable cluster hint to restrict search to , otherwise the full dataset is searched. Embedding similarities via identify the closest match within the chosen space, improving accuracy under edits while preserving baseline performance when reliability checks fail. Empirical results on multiple image datasets with DinoV2 and CLIP embeddings show gains up to substantial margins on challenging augmentations, with small false-positive rates and clear guidance for future improvements through more robust watermarking techniques.

Abstract

Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW
Paper Structure (17 sections, 1 theorem, 4 equations, 9 figures, 1 table, 2 algorithms)

This paper contains 17 sections, 1 theorem, 4 equations, 9 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Assume that the top-1 accuracy of the embedding-based retrieval on dataset $X$ with size $N$, is $\alpha$ (i.e., $\mathbb{P}(\bar{x}^* = x_i)=\alpha$), and its top-$p$ accuracy is $\alpha_p$. Then,

Figures (9)

  • Figure 1: Overview of DREW. (Top) During the pre-processing phase, $2^k$ clusters are created, each associated with a unique watermark key produced by the ECC encoder module. Instances from the dataset are randomly allocated to these clusters, and the corresponding watermark keys are injected into them. (Bottom) Upon receiving a query sample, the watermark decoder extracts the injected key, and the embedding model computes an embedding vector for the query. The ECC decoder processes this extracted key to identify the cluster code associated with the query. If the ECC reliability module confirms the reliability of the decoded cluster code, an embedding-based retrieval is conducted within the corresponding cluster. If the reliability is not ensured, the retrieval is performed across the entire dataset.
  • Figure 2: Source identification accuracy of DREW (with Trustmark used as the watermarking technique) vs naive embedding-based retrieval using DinoV2 embeddings, against different types of augmentations.
  • Figure 3: Accuracy of image retrieval using DinoV2 embeddings on a subset of $\frac{N}{2^k}$ images from each dataset. The plot illustrates the mean (lines) and std (shades around the lines) of accuracy across a set of augmentations.
  • Figure 4: False positive rate of the ECC's reliability check module ($\epsilon_r$), for various datasets and augmentations. This value was used in analysis in Section \ref{['sec:performance_analysis']}.
  • Figure 5: Accuracy of our method vs naive retrieval using DinoV2 embeddings, against different types and severity levels of augmentations. The utilized watermarking method is Trustmark.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1
  • Lemma 1