Table of Contents
Fetching ...

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, Kwan-Yee K. Wong

TL;DR

This work defines Unsupervised Concept Extraction (UCE), a task to automatically extract and reconstruct multiple concepts from a single image without any prior concept knowledge. It introduces ConceptExpress, which leverages pretrained diffusion models in two ways: (1) automatic concept localization via aggregation of diffusion self-attention and FINCH-based clustering to produce latent masks, and (2) concept-wise token learning through a token lookup and masked denoising with a split-and-merge initialization to learn discriminative tokens, plus an Earth Mover’s Distance-based attention alignment to ensure correct cross-attention correspondences. The authors establish a dedicated evaluation protocol with concept similarity and classification accuracy metrics and demonstrate that ConceptExpress outperforms a closely related supervised baseline on both quantitative and qualitative grounds, while enabling text-prompted generation of individual and compositional concepts. The work advances scalable, annotation-free discovery of concept tokens and their use in controllable image generation, with implications for building large concept libraries from unlabeled imagery. Limitations include difficulties with closely related instances and low-frequency concepts, as well as requirements on input image quality and latent resolution.

Abstract

While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models. To achieve this, we present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects. Specifically, a concept localization approach automatically locates and disentangles salient concepts by leveraging spatial correspondence from diffusion self-attention; and based on the lookup association between a concept and a conceptual token, a concept-wise optimization process learns discriminative tokens that represent each individual concept. Finally, we establish an evaluation protocol tailored for the UCE task. Extensive experiments demonstrate that ConceptExpress is a promising solution to the UCE task. Our code and data are available at: https://github.com/haoosz/ConceptExpress

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

TL;DR

This work defines Unsupervised Concept Extraction (UCE), a task to automatically extract and reconstruct multiple concepts from a single image without any prior concept knowledge. It introduces ConceptExpress, which leverages pretrained diffusion models in two ways: (1) automatic concept localization via aggregation of diffusion self-attention and FINCH-based clustering to produce latent masks, and (2) concept-wise token learning through a token lookup and masked denoising with a split-and-merge initialization to learn discriminative tokens, plus an Earth Mover’s Distance-based attention alignment to ensure correct cross-attention correspondences. The authors establish a dedicated evaluation protocol with concept similarity and classification accuracy metrics and demonstrate that ConceptExpress outperforms a closely related supervised baseline on both quantitative and qualitative grounds, while enabling text-prompted generation of individual and compositional concepts. The work advances scalable, annotation-free discovery of concept tokens and their use in controllable image generation, with implications for building large concept libraries from unlabeled imagery. Limitations include difficulties with closely related instances and low-frequency concepts, as well as requirements on input image quality and latent resolution.

Abstract

While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models. To achieve this, we present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects. Specifically, a concept localization approach automatically locates and disentangles salient concepts by leveraging spatial correspondence from diffusion self-attention; and based on the lookup association between a concept and a conceptual token, a concept-wise optimization process learns discriminative tokens that represent each individual concept. Finally, we establish an evaluation protocol tailored for the UCE task. Extensive experiments demonstrate that ConceptExpress is a promising solution to the UCE task. Our code and data are available at: https://github.com/haoosz/ConceptExpress
Paper Structure (62 sections, 16 equations, 21 figures, 10 tables)

This paper contains 62 sections, 16 equations, 21 figures, 10 tables.

Figures (21)

  • Figure 1: Unsupervised concept extraction. We focus on the unsupervised problem of extracting multiple concepts from a single image. Given an image that contains multiple concepts (e.g., Star Wars characters C-3PO, R2-D2, and desert), we aim to harness a frozen pretrained diffusion model to automatically learn the conceptual tokens. Using the learned conceptual tokens, we can regenerate the extracted concepts with high quality, as shown in the rightmost column. In this process, no human knowledge or aids are available, and we only rely on the inherent capabilities of the pretrained Stable Diffusion rombach2022high.
  • Figure 2: Overview of ConceptExpress. ConceptExpress takes a multi-concept image $\mathcal{I}$ as input and learns a set of conceptual tokens. ConceptExpress consists of three key components. First, it leverages self-attention maps from the unconditional token $\varnothing$ to locate the latent concepts. Second, it constructs a token lookup table that associates each concept mask with its corresponding conceptual token $\mathtt{[V_i]}$. Finally, it optimizes each conceptual token using a masked denoising loss. The learned conceptual tokens can then be used to generate images that represent each individual concept. See \ref{['sec:method']} for more details of the method.
  • Figure 3: Visualization.Left: we visualize the concept localization process, which involves: (1) pre-clustering that groups together semantically related regions; (2) filtering that removes non-salient regions that are not visually significant; and (3) post-clustering that integrates salient regions into instance-level concepts. Right: we visualize the token lookup table, which establishes a one-to-one correspondence between the conceptual token $\mathtt{[V_i]}$ and the learnable embedding vector $v_i$, the latent mask $\mathbf{m}_i$, and the attention map $\mathbf{f}_i$.
  • Figure 4: Split-and-merge. During the training process, we sequentially initialize conceptual tokens, train the split tokens, merge the tokens by averaging, and further fine-tune the merged tokens. Finally, the merged tokens are well-learned and effectively represent individual concepts.
  • Figure 5: Comparison with BaS$^\dag$avrahami2023break. We compare the concept extraction results of BaS$^\dag$ and ConceptExpress in 6 examples. For each example, we show the source image and the generated concept images. We annotate concepts in serial numbers for legibility.
  • ...and 16 more figures