Table of Contents
Fetching ...

Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion

Kewen Chen, Xiaobin Hu, Wenqi Ren

TL;DR

This work tackles identity preservation in subject-driven text-to-image generation by addressing incomplete disentanglement of identity versus context. It introduces a dual-level foreground-background decoupling module (IEDM) that combines implicit feature-level disentanglement with explicit foreground-background separation via inpainting, guided by complementary losses $L_2$, $L_3$, and $L_4$. A Mixture-of-Experts based Feature Fusion Module (FFM) dynamically fuses identity-irrelevant background features with identity-related foreground features, producing a refined conditioning signal $f_r$ for diffusion-based generation, with the fusion formalized as $f_r = \sum_{i=1}^k R(f_{com})_i \cdot Expert_i(f_{com})$ and $f_{com} = f_s + f_i'$ under the constraint $\sum_i R(f_{com})_i = 1$. The training objective combines the diffusion loss $L_1$ with the decoupling losses through $L = \lambda_1 L_1 + \lambda_2 L_2 + \lambda_3 L_3 + \lambda_4 L_4$, and the approach demonstrates superior image quality, text alignment, and identity fidelity on a DreamBooth-based benchmark, while remaining storage-efficient and adaptable to scene changes. Overall, the dual-level decoupling and MoE-based fusion offer a practical, scalable pathway for high-quality, identity-preserving personalized image synthesis with limited data.

Abstract

Recent advances in large-scale text-to-image generation models have led to a surge in subject-driven text-to-image generation, which aims to produce customized images that align with textual descriptions while preserving the identity of specific subjects. Despite significant progress, current methods struggle to disentangle identity-relevant information from identity-irrelevant details in the input images, resulting in overfitting or failure to maintain subject identity. In this work, we propose a novel framework that improves the separation of identity-related and identity-unrelated features and introduces an innovative feature fusion mechanism to improve the quality and text alignment of generated images. Our framework consists of two key components: an Implicit-Explicit foreground-background Decoupling Module (IEDM) and a Feature Fusion Module (FFM) based on a Mixture of Experts (MoE). IEDM combines learnable adapters for implicit decoupling at the feature level with inpainting techniques for explicit foreground-background separation at the image level. FFM dynamically integrates identity-irrelevant features with identity-related features, enabling refined feature representations even in cases of incomplete decoupling. In addition, we introduce three complementary loss functions to guide the decoupling process. Extensive experiments demonstrate the effectiveness of our proposed method in enhancing image generation quality, improving flexibility in scene adaptation, and increasing the diversity of generated outputs across various textual descriptions.

Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion

TL;DR

This work tackles identity preservation in subject-driven text-to-image generation by addressing incomplete disentanglement of identity versus context. It introduces a dual-level foreground-background decoupling module (IEDM) that combines implicit feature-level disentanglement with explicit foreground-background separation via inpainting, guided by complementary losses , , and . A Mixture-of-Experts based Feature Fusion Module (FFM) dynamically fuses identity-irrelevant background features with identity-related foreground features, producing a refined conditioning signal for diffusion-based generation, with the fusion formalized as and under the constraint . The training objective combines the diffusion loss with the decoupling losses through , and the approach demonstrates superior image quality, text alignment, and identity fidelity on a DreamBooth-based benchmark, while remaining storage-efficient and adaptable to scene changes. Overall, the dual-level decoupling and MoE-based fusion offer a practical, scalable pathway for high-quality, identity-preserving personalized image synthesis with limited data.

Abstract

Recent advances in large-scale text-to-image generation models have led to a surge in subject-driven text-to-image generation, which aims to produce customized images that align with textual descriptions while preserving the identity of specific subjects. Despite significant progress, current methods struggle to disentangle identity-relevant information from identity-irrelevant details in the input images, resulting in overfitting or failure to maintain subject identity. In this work, we propose a novel framework that improves the separation of identity-related and identity-unrelated features and introduces an innovative feature fusion mechanism to improve the quality and text alignment of generated images. Our framework consists of two key components: an Implicit-Explicit foreground-background Decoupling Module (IEDM) and a Feature Fusion Module (FFM) based on a Mixture of Experts (MoE). IEDM combines learnable adapters for implicit decoupling at the feature level with inpainting techniques for explicit foreground-background separation at the image level. FFM dynamically integrates identity-irrelevant features with identity-related features, enabling refined feature representations even in cases of incomplete decoupling. In addition, we introduce three complementary loss functions to guide the decoupling process. Extensive experiments demonstrate the effectiveness of our proposed method in enhancing image generation quality, improving flexibility in scene adaptation, and increasing the diversity of generated outputs across various textual descriptions.

Paper Structure

This paper contains 15 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example images generated by our proposed method. Our approach produces high-quality images that maintain identity consistency while aligning with the input text prompts.
  • Figure 2: Overview of our proposed method. The framework consists of the Implicit-Explicit foreground-background Decoupling Module (IEDM) for separating identity-related and identity-irrelevant features, and the Mixture of Experts (MoE)-based Feature Fusion Module (FFM) for refining the combined feature representations. The process begins with a text prompt that generates identity-related features, followed by dual-level decoupling of the input image to extract identity-irrelevant background features. These features are then integrated through the MoE-based FFM, and the refined feature representations are used as conditioning input for the U-Net denoising process to produce high-quality images.
  • Figure 3: Qualitative result. We compared our approach with current state-of-the-art methods, including Textual Inversion, DreamBooth, AttnDreamBooth, DisenBooth, and TextBoost, on the Dreambooth dataset. Our method demonstrates outstanding performance across multiple objects and animals, generating high-quality images with strong identity preservation and text alignment.
  • Figure 4: Visualization of Ablation results. We applied the prompt "a photo of a V* stuffed animal in the snow" to the specific subject "bear plushie.", illustrating the impact of different components of our proposed method.