Table of Contents
Fetching ...

Measuring Style Similarity in Diffusion Models

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, Tom Goldstein

TL;DR

This work tackles the challenge of attributing and measuring artistic style in diffusion-based image generation. It introduces LAION-Styles, a large multi-label dataset, and Contrastive Style Descriptors (CSD), a two-part learning objective that jointly captures style while remaining invariant to content. The authors demonstrate state-of-the-art zero-shot style retrieval on DomainNet and WikiArt, analyze Stable Diffusion 2.1 to reveal varying degrees of style replication across artists, and provide practical insights into how prompts influence style copying. By offering a public dataset and code, the paper provides a valuable toolkit for artists and practitioners to assess and understand style attribution and copying in generative models.

Abstract

Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus on retrieving images of similar semantic content. Meanwhile, many artists are concerned with style replication in text-to-image models. We present a framework for understanding and extracting style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image that captures complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We showcase promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model. Code and artifacts are available at https://github.com/learn2phoenix/CSD.

Measuring Style Similarity in Diffusion Models

TL;DR

This work tackles the challenge of attributing and measuring artistic style in diffusion-based image generation. It introduces LAION-Styles, a large multi-label dataset, and Contrastive Style Descriptors (CSD), a two-part learning objective that jointly captures style while remaining invariant to content. The authors demonstrate state-of-the-art zero-shot style retrieval on DomainNet and WikiArt, analyze Stable Diffusion 2.1 to reveal varying degrees of style replication across artists, and provide practical insights into how prompts influence style copying. By offering a public dataset and code, the paper provides a valuable toolkit for artists and practitioners to assess and understand style attribution and copying in generative models.

Abstract

Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus on retrieving images of similar semantic content. Meanwhile, many artists are concerned with style replication in text-to-image models. We present a framework for understanding and extracting style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image that captures complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We showcase promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model. Code and artifacts are available at https://github.com/learn2phoenix/CSD.
Paper Structure (22 sections, 1 equation, 11 figures, 4 tables)

This paper contains 22 sections, 1 equation, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Original artwork of 6 popular artists and the images generated in the style of these artists by three popular text-to-image generative models. The numbers displayed below each image indicates the similarity of generated image with artist's style using proposed method. A high similarity score suggests a strong presence of the artist's style elements in the generated image. Based on our analyses, we postulate that three artists on the right were removed (or unlearned) from SD 2.1 while they were present in MidJourney and SD 1.4. Please refer to \ref{['sec:sd_96artist_analysis']} for more details.
  • Figure 2: Style similarity of Stable Diffusion 2.1 generated images against the artist's prototypical representation. On the X-axis, the similarities are depicted when the prompt is not constrained, while the Y-axis represents similarity when the prompt is constrained to generate an image of a "woman" in the artist's style.
  • Figure 3: Confusion Matrix of errors in WikiArt: Art movements are predicted correctly. Errors occur in cases where movements share the same historical timeline and/or are derived from the same earlier movement.
  • Figure 4: Human study on Style Retrieval: Turns out untrained humans are worse than many feature extractors on matching images from the same artist.
  • Figure 5: Nearest "style" neighbors. For each generated image (referred to as SD Gen), we show the top 5 style neighbors in CSD using our feature extractor. The green and red box around the image indicates whether or not the artist's name used to generate the SD image was present in the caption of the nearest neighbor.
  • ...and 6 more figures