Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion

Adi Haviv; Shahar Sarfaty; Uri Hacohen; Niva Elkin-Koren; Roi Livni; Amit H Bermano

Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion

Adi Haviv, Shahar Sarfaty, Uri Hacohen, Niva Elkin-Koren, Roi Livni, Amit H Bermano

TL;DR

This paper tackles the challenge of quantifying originality in text-to-image diffusion models by proposing a token-length based measure derived from multi-token textual inversion. Through controlled synthetic generalization experiments and real-world domain tests, it demonstrates that models preferentially reconstruct familiar concepts with shorter token sequences, while original or unseen content requires more tokens for accurate reconstruction, correlating with perceived originality. The approach combines Stable Diffusion mechanics, multi-token textual inversion, and DreamSim-based reconstruction оценку to assess originality without relying on training data prompts or data disclosure. The findings suggest that model familiarity underpins originality signals and have implications for copyright analysis, model auditing, and the responsible deployment of generative content. Overall, the work provides a practical, distribution-aware framework for assessing originality in generative models and highlights the value of dataset diversity for fostering creative output within legal and ethical boundaries.

Abstract

This work addresses the challenge of quantifying originality in text-to-image (T2I) generative diffusion models, with a focus on copyright originality. We begin by evaluating T2I models' ability to innovate and generalize through controlled experiments, revealing that stable diffusion models can effectively recreate unseen elements with sufficiently diverse training data. Then, our key insight is that concepts and combinations of image elements the model is familiar with, and saw more during training, are more concisly represented in the model's latent space. We hence propose a method that leverages textual inversion to measure the originality of an image based on the number of tokens required for its reconstruction by the model. Our approach is inspired by legal definitions of originality and aims to assess whether a model can produce original content without relying on specific prompts or having the training data of the model. We demonstrate our method using both a pre-trained stable diffusion model and a synthetic dataset, showing a correlation between the number of tokens and image originality. This work contributes to the understanding of originality in generative models and has implications for copyright infringement cases.

Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion

TL;DR

Abstract

Paper Structure (50 sections, 1 equation, 13 figures, 1 table)

This paper contains 50 sections, 1 equation, 13 figures, 1 table.

Introduction
T2I Models Produce Original Content
Setup and Methodology
Data Diversity Promotes Generalizability
Measuring Originally Using Conditioned Text
Stable Diffusion
Textual Inversion
Method
Single Token vs. Multi-Token:
Reconstruction
In-Distribution Assessment
Experimental setup
Synthetic Framework
Quantifying Originality in Synthetic Framework
Real-World Setting
...and 35 more sections

Figures (13)

Figure 1: Illustration of our approach for measuring image originality using multi-token textual inversion. Original images require more tokens for accurate reconstruction, while common images like Van Gogh's "Starry Night" need only one token.
Figure 2: Generalization experiments diagram on synthetic data. (i) We evaluate the relationship between data diversity and originality by running experiments over sets of distinct elements in increasing sizes. (ii) Examples of datasets synthesized from the respective element sets illustrate the variety within the data. (iii) T2I models trained from scratch using the corresponding datasets, with images generated by prompting the models with either an empty prompt or specific element descriptions.
Figure 3: Synthetic generalization experiments results. Center: Generalization capability of the trained models vs. training data diversity (x-axis) and conditioning types (blue line vs. orange line). Sides: Detailed distributions for a specific set and missing elements. These results support the notion that models generate both original and reproduced content, highly depending on the training data.
Figure 4: Method overview. We begin with a query image and a domain-relevant prompt (left). The query is processed through textual inversion gal2022image with different token lengths. With each inversion, images are reconstructed and edited (generation with variations). After ensuring each reconstruction is in-distribution, we estimate the concept generative quality fu2023dreamsim (right).
Figure 5: Qualitative results for reconstructing common, rare, and unseen images. Unseen concepts require five tokens for correct reconstruction, rare images require three, and common ones only one.
...and 8 more figures

Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion

TL;DR

Abstract

Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (13)