EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations

Jie Ren; Yingqian Cui; Chen Chen; Yue Xing; Hui Liu; Lingjuan Lyu

EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations

Jie Ren, Yingqian Cui, Chen Chen, Yue Xing, Hui Liu, Lingjuan Lyu

TL;DR

EnTruth tackles copyright protection for training datasets in text-to-image diffusion by converting memorization into verifiable evidence of unauthorized usage. It introduces template memorization (TM), generating a templated set with a shared template and diverse foregrounds, plus dataset-specific trigger tokens to induce memorization signals when models are fine-tuned on protected data. The approach formalizes TM, details template and foreground generation, and deploys a two-level verification (one-query and multiple-query) to detect infringement under attacks like de-duplication and re-captioning, while preserving image-generation quality. Empirical results across multiple datasets and diffusion-model variants demonstrate high detection accuracy, robustness to corruptions, and minimal impact on generation quality, with alteration rates as low as $0.2\%$ and protection up to $0.5\%$.

Abstract

Generative models, especially text-to-image diffusion models, have significantly advanced in their ability to generate images, benefiting from enhanced architectures, increased computational power, and large-scale datasets. While the datasets play an important role, their protection has remained as an unsolved issue. Current protection strategies, such as watermarks and membership inference, are either in high poison rate which is detrimental to image quality or suffer from low accuracy and robustness. In this work, we introduce a novel approach, EnTruth, which Enhances Traceability of unauthorized dataset usage utilizing template memorization. By strategically incorporating the template memorization, EnTruth can trigger the specific behavior in unauthorized models as the evidence of infringement. Our method is the first to investigate the positive application of memorization and use it for copyright protection, which turns a curse into a blessing and offers a pioneering perspective for unauthorized usage detection in generative models. Comprehensive experiments are provided to demonstrate its effectiveness in terms of data-alteration rate, accuracy, robustness and generation quality.

EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations

TL;DR

and protection up to

Abstract

Paper Structure (23 sections, 2 equations, 11 figures, 5 tables)

This paper contains 23 sections, 2 equations, 11 figures, 5 tables.

Introduction
Preliminary Study
Exact memorization by data duplication enhances the detection of unauthorized usage
Challenges of Data Duplication
Method
Template Memorization
Generation of Template
Generation of Foregrounds
Two Levels of Verification
Experiment
Main Results
Robustness Study
Ablation Study
Different Fine-tuning Scenarios
Conclusion
...and 8 more sections

Figures (11)

Figure 1: In template memorization (TM), the T2I model learns the shared template in training images and reproduces the template in generated images
Figure 2: (a) The similarity score between duplicate data ${x}_{dup}$ and images generated by ${t}_{dup}$. (b) The distribution of SSCD within CC-20k. (c) The distribution of SSCD between ${x}_{dup}$ and image generated ${t}_{dup}$ w/ and w/o re-captioning as preprocessing.
Figure 3: SSCD of pairs in $T$
Figure 4: SSCD of pairs in $T \cup$ CC-20k
Figure 5: Memorization speed
...and 6 more figures

EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations

TL;DR

Abstract

EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations

Authors

TL;DR

Abstract

Table of Contents

Figures (11)