Table of Contents
Fetching ...

CusConcept: Customized Visual Concept Decomposition with Diffusion Models

Zhi Xu, Shaozhe Hao, Kai Han

TL;DR

This work introduces CusConcept, a two-stage framework for customized visual concept decomposition that separates object and axis-aligned attribute concepts from a single image using diffusion models. It combines vocabulary-guided concept decomposition, leveraging LLM-derived axis vocabularies and a learnable weighting scheme to form concept centroids, with a joint refinement stage that fine-tunes token embeddings via multi-token Textual Inversion. An evaluation benchmark based on VAW-CZSL and three metrics (CLIP-I, SIMemb, ACC) demonstrates state-of-the-art performance in generation fidelity, embedding alignment, and retrieval accuracy, validating open-world concept decomposition. The method enables vocabulary prediction, concept removal, and recomposition, offering practical control over generated images and lexical outputs. Overall, CusConcept advances open-world concept manipulation in diffusion-based generation, enabling flexible, axis-driven decomposition from limited input data.

Abstract

Enabling generative models to decompose visual concepts from a single image is a complex and challenging problem. In this paper, we study a new and challenging task, customized concept decomposition, wherein the objective is to leverage diffusion models to decompose a single image and generate visual concepts from various perspectives. To address this challenge, we propose a two-stage framework, CusConcept (short for Customized Visual Concept Decomposition), to extract customized visual concept embedding vectors that can be embedded into prompts for text-to-image generation. In the first stage, CusConcept employs a vocabulary-guided concept decomposition mechanism to build vocabularies along human-specified conceptual axes. The decomposed concepts are obtained by retrieving corresponding vocabularies and learning anchor weights. In the second stage, joint concept refinement is performed to enhance the fidelity and quality of generated images. We further curate an evaluation benchmark for assessing the performance of the open-world concept decomposition task. Our approach can effectively generate high-quality images of the decomposed concepts and produce related lexical predictions as secondary outcomes. Extensive qualitative and quantitative experiments demonstrate the effectiveness of CusConcept.

CusConcept: Customized Visual Concept Decomposition with Diffusion Models

TL;DR

This work introduces CusConcept, a two-stage framework for customized visual concept decomposition that separates object and axis-aligned attribute concepts from a single image using diffusion models. It combines vocabulary-guided concept decomposition, leveraging LLM-derived axis vocabularies and a learnable weighting scheme to form concept centroids, with a joint refinement stage that fine-tunes token embeddings via multi-token Textual Inversion. An evaluation benchmark based on VAW-CZSL and three metrics (CLIP-I, SIMemb, ACC) demonstrates state-of-the-art performance in generation fidelity, embedding alignment, and retrieval accuracy, validating open-world concept decomposition. The method enables vocabulary prediction, concept removal, and recomposition, offering practical control over generated images and lexical outputs. Overall, CusConcept advances open-world concept manipulation in diffusion-based generation, enabling flexible, axis-driven decomposition from limited input data.

Abstract

Enabling generative models to decompose visual concepts from a single image is a complex and challenging problem. In this paper, we study a new and challenging task, customized concept decomposition, wherein the objective is to leverage diffusion models to decompose a single image and generate visual concepts from various perspectives. To address this challenge, we propose a two-stage framework, CusConcept (short for Customized Visual Concept Decomposition), to extract customized visual concept embedding vectors that can be embedded into prompts for text-to-image generation. In the first stage, CusConcept employs a vocabulary-guided concept decomposition mechanism to build vocabularies along human-specified conceptual axes. The decomposed concepts are obtained by retrieving corresponding vocabularies and learning anchor weights. In the second stage, joint concept refinement is performed to enhance the fidelity and quality of generated images. We further curate an evaluation benchmark for assessing the performance of the open-world concept decomposition task. Our approach can effectively generate high-quality images of the decomposed concepts and produce related lexical predictions as secondary outcomes. Extensive qualitative and quantitative experiments demonstrate the effectiveness of CusConcept.
Paper Structure (31 sections, 4 equations, 10 figures, 2 tables)

This paper contains 31 sections, 4 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Customized concept decomposition. Our aim is to decompose the input image into the object concept and the attribute concepts along user-specified axes. Left: We consider each visual entity to be the composition of the "object" concept and multiple "attributes" defined along different attribute axes. Each disentangled concept, including the object and its attributes, has a domain, here simplified as one-dimensional probability distributions. Right: We illustrate the learning of concept embeddings in the 2D space. (a) The words are distributed in the embedding space (gray dots), with words along the same attribute axis marked with the same color, e.g., pink for age. (b) The word embeddings are combined using a weighted sum, similar to finding the centroids (triangles) of the same color dots in the space. (c) The weighted sum embeddings are further fine-tuned into final concept embeddings, like moving from triangles to stars in the space.
  • Figure 2: Pipeline. Given an input image and user-specified attribute axes, we aim to decompose the visual concepts including the object concept and the attributes concepts along the specified axes. Our method encompasses two stages. (1) To obtain concept vocabularies, we query an LLM (like ChatGPT achiam2023gpt) to derive axis-wise attribute vocabularies, and examine CLIP similarities between object nouns and the input image to derive object vocabularies. On the derived vocabularies, we train learnable anchor weights, such as $w_{d_k}$ on the $d_k$attribute axis and $w_o$ for the object concept, to select and aggregate the corresponding token embeddings. (2) With the aggregated token embedding $u_\star$, which represents the concept centroid of the object or its attributes along specific axes, we further fine-tune them jointly to enhance the fidelity and quality of generation. The fine-tuned token embeddings, such as $S_\star^{d_k}$ and $S_\star^o$, can be inserted into text prompts for concept generation.
  • Figure 3: Comparison between LLMs. Taking the age attribute axis as an example, we compare the words generated by GPT-4 and Claude 3.5 Sonnet.
  • Figure 4: Comparison across different orders of attributes. We present images generated with different orders of attribute axes.
  • Figure 5: Qualitative comparison. Given one input image (top row), we compare TI$\alpha$ (1st row), TI$\beta$ (2nd row), and our method (bottom row). We provide the ground-truth labels for the object and attribute concepts for reference but note that they are not available along training in TI$\beta$ and ours. Resulting images are generated with the prompt "a photo of" followed by the text below the generated image.
  • ...and 5 more figures