Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion
Kewen Chen, Xiaobin Hu, Wenqi Ren
TL;DR
This work tackles identity preservation in subject-driven text-to-image generation by addressing incomplete disentanglement of identity versus context. It introduces a dual-level foreground-background decoupling module (IEDM) that combines implicit feature-level disentanglement with explicit foreground-background separation via inpainting, guided by complementary losses $L_2$, $L_3$, and $L_4$. A Mixture-of-Experts based Feature Fusion Module (FFM) dynamically fuses identity-irrelevant background features with identity-related foreground features, producing a refined conditioning signal $f_r$ for diffusion-based generation, with the fusion formalized as $f_r = \sum_{i=1}^k R(f_{com})_i \cdot Expert_i(f_{com})$ and $f_{com} = f_s + f_i'$ under the constraint $\sum_i R(f_{com})_i = 1$. The training objective combines the diffusion loss $L_1$ with the decoupling losses through $L = \lambda_1 L_1 + \lambda_2 L_2 + \lambda_3 L_3 + \lambda_4 L_4$, and the approach demonstrates superior image quality, text alignment, and identity fidelity on a DreamBooth-based benchmark, while remaining storage-efficient and adaptable to scene changes. Overall, the dual-level decoupling and MoE-based fusion offer a practical, scalable pathway for high-quality, identity-preserving personalized image synthesis with limited data.
Abstract
Recent advances in large-scale text-to-image generation models have led to a surge in subject-driven text-to-image generation, which aims to produce customized images that align with textual descriptions while preserving the identity of specific subjects. Despite significant progress, current methods struggle to disentangle identity-relevant information from identity-irrelevant details in the input images, resulting in overfitting or failure to maintain subject identity. In this work, we propose a novel framework that improves the separation of identity-related and identity-unrelated features and introduces an innovative feature fusion mechanism to improve the quality and text alignment of generated images. Our framework consists of two key components: an Implicit-Explicit foreground-background Decoupling Module (IEDM) and a Feature Fusion Module (FFM) based on a Mixture of Experts (MoE). IEDM combines learnable adapters for implicit decoupling at the feature level with inpainting techniques for explicit foreground-background separation at the image level. FFM dynamically integrates identity-irrelevant features with identity-related features, enabling refined feature representations even in cases of incomplete decoupling. In addition, we introduce three complementary loss functions to guide the decoupling process. Extensive experiments demonstrate the effectiveness of our proposed method in enhancing image generation quality, improving flexibility in scene adaptation, and increasing the diversity of generated outputs across various textual descriptions.
