Table of Contents
Fetching ...

OmniPrism: Learning Disentangled Visual Concept for Image Generation

Yangyang Li, Daqing Liu, Wu Liu, Allen He, Xinchen Liu, Yongdong Zhang, Guoqing Jin

TL;DR

OmniPrism tackles concept disentanglement in image generation by learning language-guided representations for content, style, and composition and injecting them into diffusion models via cross-attention. It introduces a Contrastive Orthogonal Disentangled (COD) learning objective and a Paired Concept Disentanglement Dataset (PCD-200K) to enforce orthogonal, non-conflicting concept representations. A learnable block embedding aligns each diffusion block with its corresponding concept domain, enabling flexible combination of multiple concepts without interference. Empirical results on Stable Diffusion XL show improved fidelity to prompts and target concepts over baselines, backed by quantitative metrics and qualitative analyses, and the work provides a valuable dataset and methodological blueprint for robust, controllable multi-concept generation.

Abstract

Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block's concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts.

OmniPrism: Learning Disentangled Visual Concept for Image Generation

TL;DR

OmniPrism tackles concept disentanglement in image generation by learning language-guided representations for content, style, and composition and injecting them into diffusion models via cross-attention. It introduces a Contrastive Orthogonal Disentangled (COD) learning objective and a Paired Concept Disentanglement Dataset (PCD-200K) to enforce orthogonal, non-conflicting concept representations. A learnable block embedding aligns each diffusion block with its corresponding concept domain, enabling flexible combination of multiple concepts without interference. Empirical results on Stable Diffusion XL show improved fidelity to prompts and target concepts over baselines, backed by quantitative metrics and qualitative analyses, and the work provides a valuable dataset and methodological blueprint for robust, controllable multi-concept generation.

Abstract

Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block's concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts.

Paper Structure

This paper contains 31 sections, 8 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: We propose OmniPrism, which arbitrarily disentangles and combines visual concepts. (a) Disentangled visual concept generation. Given a reference image with multiple concepts, our method can disentangle the desired concept guided by natural language such as content names (red color words in prompts), "style" or "composition" (e.g., relation or structural features like pose) while remaining faithful to prompts. (b) Multi-concept combination. Given two or more reference images with the corresponding concept guidance, our approach can combine all desired concepts in any combination without conflicts.
  • Figure 2: Challenges in visual concept generation. (a) Limited concept space of single-aspect concept generation, which is only suitable for single tasks. (b) Previous multi-aspect concept generation works often struggled with concept confusion. (c) We disentangle different concepts in the representation space, thereby achieving results without irrelevant concepts.
  • Figure 3: Framework of OmniPrism. (a) Given the reference image $\bm{I}_{ref}$, target prompt $\bm{T}_{tar}$ and concept guidance $\bm{T}_{cg}$, the concept extractor disentangles concept representations $\bm{f}_{cpt}$ by concatenating CLIP features $\bm{f}_{cg}$ of $\bm{T}_{cg}$ with a learnable query $\bm{q}$, and feeds $\bm{f}_{cpt}$ into additional cross-attention layers in U-Net to generate target image $\bm{I}_{tar}$. A learnable block embedding $\bm{e}_i$ is added to $\bm{q}$ to align the concept domain of i-th diffusion block. (b) We employ an anti-query $\bm{q}_a$ to capture irrelevant concepts $\bm{f}^{a}_{cpt}$ in $\bm{I}_{ref}$, and constrain the desired concept $\bm{f}^{tar}_{cpt}$ in $\bm{I}_{tar}$ to be similar to $\bm{f}_{cpt}$ and orthogonal to $\bm{f}^{a}_{cpt}$ by Contrastive Orthogonal Disentangled (COD) Learning.
  • Figure 4: Diverse capabilities of our method. Our method supports the single concept disentangled generation from a same reference image, including different content, style, and composition. In addition, we can combine these disentangled concepts to generate results that incorporate multiple desired concepts.
  • Figure 5: Comparison with the state-of-the-art works. Our method achieves superior disentangled generation performance. It not only avoids introducing irrelevant concepts but also ensures the highest concept and prompt fidelity and image quality.
  • ...and 14 more figures