Table of Contents
Fetching ...

ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models

Fernando Julio Cendra, Kai Han

TL;DR

ICE tackles the ambiguity of visual concepts in diffusion-based T2I models by automatically extracting intrinsic concepts from a single image. It introduces a two-stage pipeline: Stage Onelocalizes object-level concepts and masks using a CLIP-based retriever and a zero-shot segmentor within a pretrained diffusion model, and Stage Two decomposes these concepts into intrinsic attributes through object-level and intrinsic triplet losses, followed by limited refinement of the U-Net and text encoder. The approach demonstrates superior unsupervised concept extraction on UCE benchmarks, outperforming prior methods in both identity and compositional similarity, and enables precise compositional concept generation. By leveraging a single T2I model for both localization and structured learning, ICE offers a scalable, interpretable framework for disentangling object-level concepts from intrinsic attributes with practical implications for controllable image synthesis and zero-shot segmentation.

Abstract

The inherent ambiguity in defining visual concepts poses significant challenges for modern generative models, such as the diffusion-based Text-to-Image (T2I) models, in accurately learning concepts from a single image. Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilises a T2I model to automatically and systematically extract intrinsic concepts from a single image. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module to pinpoint relevant text-based concepts and their corresponding masks within the image. This critical stage streamlines concept initialization and provides precise guidance for subsequent analysis. The second stage delves deeper into each identified mask, decomposing the object-level concepts into intrinsic concepts and general concepts. This decomposition allows for a more granular and interpretable breakdown of visual elements. Our framework demonstrates superior performance on intrinsic concept extraction from a single image in an unsupervised manner. Project page: https://visual-ai.github.io/ice

ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models

TL;DR

ICE tackles the ambiguity of visual concepts in diffusion-based T2I models by automatically extracting intrinsic concepts from a single image. It introduces a two-stage pipeline: Stage Onelocalizes object-level concepts and masks using a CLIP-based retriever and a zero-shot segmentor within a pretrained diffusion model, and Stage Two decomposes these concepts into intrinsic attributes through object-level and intrinsic triplet losses, followed by limited refinement of the U-Net and text encoder. The approach demonstrates superior unsupervised concept extraction on UCE benchmarks, outperforming prior methods in both identity and compositional similarity, and enables precise compositional concept generation. By leveraging a single T2I model for both localization and structured learning, ICE offers a scalable, interpretable framework for disentangling object-level concepts from intrinsic attributes with practical implications for controllable image synthesis and zero-shot segmentation.

Abstract

The inherent ambiguity in defining visual concepts poses significant challenges for modern generative models, such as the diffusion-based Text-to-Image (T2I) models, in accurately learning concepts from a single image. Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilises a T2I model to automatically and systematically extract intrinsic concepts from a single image. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module to pinpoint relevant text-based concepts and their corresponding masks within the image. This critical stage streamlines concept initialization and provides precise guidance for subsequent analysis. The second stage delves deeper into each identified mask, decomposing the object-level concepts into intrinsic concepts and general concepts. This decomposition allows for a more granular and interpretable breakdown of visual elements. Our framework demonstrates superior performance on intrinsic concept extraction from a single image in an unsupervised manner. Project page: https://visual-ai.github.io/ice

Paper Structure

This paper contains 25 sections, 10 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: We showcase a structured approach for defining visual concepts within an image, where object-level concepts are identified and analyzed to reveal their underlying intrinsic attributes, such as object category, colour, and material. We present the ICE (Intrinsic Concept Extraction) framework, which leverages Text-to-Image (T2I) models to systematically discover these concepts, providing a more effective method for learning visual concepts.
  • Figure 2: Concept definition hierarchy illustrating how object-level concepts are decomposed into intrinsic attributes, including object category type, colour, material and other intrinsics.
  • Figure 3: Illustration of the proposed ICE (Intrinsic Concept Extraction) framework, which consists of two stages: ($1$) Automatic Concept Localization, where a diffusion model is employed to extract object-level concepts and their corresponding masks from an image without prior training, and ($2$) Structured Concept Learning, where these extracted information are further leveraged to uncover essential concepts.
  • Figure 4: Stage One: Automatic Concept Localization. Starting with an unlabelled image $\mathbf{x}$, the Image-to-Text concept extractor retrieves the top-$1$ text-based concept $c_i$ using CLIP encoders. A zero-shot segmentor via T2I model generates the corresponding mask $\mathbf{m}_i$, and the image is updated by removing the masked region. This process iterates until no objects remain in the image.
  • Figure 5: Stage Two: Structured Concept Learning. This stage is divided into two phases: (1) learning object-level concepts, where concept-specific ($c_i^{\text{conspec}}$) and instance-specific ($c_i^{\text{inspec}}$) tokens are learned using an object-level triplet loss $\mathcal{L}_{\text{triplet}}^{\text{obj}}$, and (2) learning intrinsic concepts, which decomposes object-level concepts into intrinsic attributes ($c_j^{\text{intrinsic}}$) using an intrinsic triplet loss $\mathcal{L}_{\text{triplet}}^{\text{intrinsic}}$. This hierarchical approach ensures accurate separation of general semantic categories from specific and intrinsic attributes.
  • ...and 6 more figures