ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models
Fernando Julio Cendra, Kai Han
TL;DR
ICE tackles the ambiguity of visual concepts in diffusion-based T2I models by automatically extracting intrinsic concepts from a single image. It introduces a two-stage pipeline: Stage Onelocalizes object-level concepts and masks using a CLIP-based retriever and a zero-shot segmentor within a pretrained diffusion model, and Stage Two decomposes these concepts into intrinsic attributes through object-level and intrinsic triplet losses, followed by limited refinement of the U-Net and text encoder. The approach demonstrates superior unsupervised concept extraction on UCE benchmarks, outperforming prior methods in both identity and compositional similarity, and enables precise compositional concept generation. By leveraging a single T2I model for both localization and structured learning, ICE offers a scalable, interpretable framework for disentangling object-level concepts from intrinsic attributes with practical implications for controllable image synthesis and zero-shot segmentation.
Abstract
The inherent ambiguity in defining visual concepts poses significant challenges for modern generative models, such as the diffusion-based Text-to-Image (T2I) models, in accurately learning concepts from a single image. Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilises a T2I model to automatically and systematically extract intrinsic concepts from a single image. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module to pinpoint relevant text-based concepts and their corresponding masks within the image. This critical stage streamlines concept initialization and provides precise guidance for subsequent analysis. The second stage delves deeper into each identified mask, decomposing the object-level concepts into intrinsic concepts and general concepts. This decomposition allows for a more granular and interpretable breakdown of visual elements. Our framework demonstrates superior performance on intrinsic concept extraction from a single image in an unsupervised manner. Project page: https://visual-ai.github.io/ice
