Tackling Copyright Issues in AI Image Generation Through Originality Estimation and Genericization

Hiroaki Chiba-Okabe; Weijie J. Su

Tackling Copyright Issues in AI Image Generation Through Originality Estimation and Genericization

Hiroaki Chiba-Okabe, Weijie J. Su

TL;DR

This work tackles copyright risks in AI image generation by introducing an originality-based framework and a practical mitigation method called PREGen. The originality metric, $\mathrm{Originality}(c|x)=\mathbb{E}_{y\sim P(\cdot|x)}[d(c,y)]$, is estimated from model samples and used to steer outputs toward generic representations. PREGen combines prompt rewriting with a model-agnostic genericization step that selects outputs with the lowest estimated originality among internally generated samples, reducing the likelihood of reproducing protected characters—often dramatically (e.g., DETECT drops to zero in some settings)—while preserving alignment with user intent. The approach shows strong empirical gains on the COPYCAT benchmark and offers a scalable path to responsible generative AI, though at the cost of extra computation and potential deviations from user prompts. Overall, the paper provides a concrete methodology to quantify originality and reinforce copyright protection in generative systems, with practical implications for policy and deployment.

Abstract

The rapid progress of generative AI technology has sparked significant copyright concerns, leading to numerous lawsuits filed against AI developers. Notably, generative AI's capacity for generating images of copyrighted characters has been well documented in the literature, and while various techniques for mitigating copyright issues have been studied, significant risks remain. Here, we propose a genericization method that modifies the outputs of a generative model to make them more generic and less likely to imitate distinctive features of copyrighted materials. To achieve this, we introduce a metric for quantifying the level of originality of data, estimated by drawing samples from a generative model, and applied in the genericization process. As a practical implementation, we introduce PREGen (Prompt Rewriting-Enhanced Genericization), which combines our genericization method with an existing mitigation technique. Compared to the existing method, PREGen reduces the likelihood of generating copyrighted characters by more than half when the names of copyrighted characters are used as the prompt. Additionally, while generative models can produce copyrighted characters even when their names are not directly mentioned in the prompt, PREGen almost entirely prevents the generation of such characters in these cases. Ultimately, this study advances computational approaches for quantifying and strengthening copyright protection, thereby providing practical methodologies to promote responsible generative AI development.

Tackling Copyright Issues in AI Image Generation Through Originality Estimation and Genericization

TL;DR

This work tackles copyright risks in AI image generation by introducing an originality-based framework and a practical mitigation method called PREGen. The originality metric,

, is estimated from model samples and used to steer outputs toward generic representations. PREGen combines prompt rewriting with a model-agnostic genericization step that selects outputs with the lowest estimated originality among internally generated samples, reducing the likelihood of reproducing protected characters—often dramatically (e.g., DETECT drops to zero in some settings)—while preserving alignment with user intent. The approach shows strong empirical gains on the COPYCAT benchmark and offers a scalable path to responsible generative AI, though at the cost of extra computation and potential deviations from user prompts. Overall, the paper provides a concrete methodology to quantify originality and reinforce copyright protection in generative systems, with practical implications for policy and deployment.

Abstract

Paper Structure (37 sections, 4 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 37 sections, 4 equations, 10 figures, 8 tables, 1 algorithm.

Introduction
Originality estimation
Originality in the context of copyright law
Originality metric
Leveraging generative models to estimate originality
Choices of the distance metric
Genericization
PREGen
Experiments
Results for originality estimation
Results for genericization
Benchmark and evaluation metrics
Models and algorithms
Results
Related works
...and 22 more sections

Figures (10)

Figure 1: Images produced by the generative model.Sample images generated by the generative model are displayed, starting from those generated using the abstract prompts on the left and moving towards those generated with increasingly specific prompts towards the right. More specific prompts tend to produce images that share more visual elements with the copyrighted characters. Images were generated using SDXL.
Figure 2: Originality estimates of copyrighted images and generated images (CLIP). The panels on the left and right show the originality estimates of the images of Mario and Pooh, respectively. Copyrighted images exhibit higher originality values, as shown by green bars with black whiskers representing standard deviation, particularly with abstract prompts. Comparison with the mean and standard deviation of originality estimates of images produced by the generative model using each prompt (red dots and dotted lines) indicates that the originality of the copyrighted images tends to be substantially higher than typical outputs.
Figure 3: Distribution of similarity before and after genericization (CLIP). The panels on the left and right show the distributions of cosine similarity computed from CLIP embeddings for images generated by prompts associated with Mario and Pooh, respectively. Outputs concentrate more around images highly similar to generic images after genericization, whereas outputs highly similar to the images of copyrighted characters are suppressed.
Figure 4: Images generated in the direct anchoring scenario. When the user's prompt is the name of the copyrighted character itself, the generative models can generate images that closely resemble the copyrighted character. PREGen tends to successfully exclude elements that are highly unique to copyrighted characters even when the standard method fails to do so. Images were generated using Playground v2.5, Pixart-$\alpha$, and SDXL.
Figure 5: Originality estimates of copyrighted images and generated images (DINOv2). The panels on the left and right show the originality estimates of the images of Mario and Pooh, respectively.
...and 5 more figures

Tackling Copyright Issues in AI Image Generation Through Originality Estimation and Genericization

TL;DR

Abstract

Tackling Copyright Issues in AI Image Generation Through Originality Estimation and Genericization

Authors

TL;DR

Abstract

Table of Contents

Figures (10)