ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction
Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, Kwan-Yee K. Wong
TL;DR
This work defines Unsupervised Concept Extraction (UCE), a task to automatically extract and reconstruct multiple concepts from a single image without any prior concept knowledge. It introduces ConceptExpress, which leverages pretrained diffusion models in two ways: (1) automatic concept localization via aggregation of diffusion self-attention and FINCH-based clustering to produce latent masks, and (2) concept-wise token learning through a token lookup and masked denoising with a split-and-merge initialization to learn discriminative tokens, plus an Earth Mover’s Distance-based attention alignment to ensure correct cross-attention correspondences. The authors establish a dedicated evaluation protocol with concept similarity and classification accuracy metrics and demonstrate that ConceptExpress outperforms a closely related supervised baseline on both quantitative and qualitative grounds, while enabling text-prompted generation of individual and compositional concepts. The work advances scalable, annotation-free discovery of concept tokens and their use in controllable image generation, with implications for building large concept libraries from unlabeled imagery. Limitations include difficulties with closely related instances and low-frequency concepts, as well as requirements on input image quality and latent resolution.
Abstract
While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models. To achieve this, we present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects. Specifically, a concept localization approach automatically locates and disentangles salient concepts by leveraging spatial correspondence from diffusion self-attention; and based on the lookup association between a concept and a conceptual token, a concept-wise optimization process learns discriminative tokens that represent each individual concept. Finally, we establish an evaluation protocol tailored for the UCE task. Extensive experiments demonstrate that ConceptExpress is a promising solution to the UCE task. Our code and data are available at: https://github.com/haoosz/ConceptExpress
