Table of Contents
Fetching ...

DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation

Zhizhong Wang, Tianyi Chu, Zeyi Huang, Nanyang Wang, Kehan Li

TL;DR

This work tackles zero-shot personalized text-to-image generation (PT2I) by addressing three core challenges: preserving target concepts, maintaining fine-grained details, and scaling to multi-subject compositions. It introduces DynaIP, a plug-in for MM-DiT that combines a Dynamic Decoupling Strategy (DDS) to separate concept-specific from concept-agnostic information and a Hierarchical Mixture-of-Experts Feature Fusion Module (HMoE-FFM) to fuse CLIP's multi-layer features for flexible visual granularity. DDS enhances the balance between Concept Preservation and Prompt Following and improves scalability to multi-subject cases, while HMoE-FFM delivers high-fidelity details and tunable granularity through dynamic expert routing. The approach, trained on single-subject data, demonstrates strong performance on both single- and multi-subject PT2I benchmarks, with extensive ablations and user studies confirming the effectiveness of the two core components and their compatibility with base model extensions. Overall, DynaIP offers a scalable, flexible, and high-fidelity solution for personalized T2I generation with broad applicability and practical benefits.

Abstract

Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.

DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation

TL;DR

This work tackles zero-shot personalized text-to-image generation (PT2I) by addressing three core challenges: preserving target concepts, maintaining fine-grained details, and scaling to multi-subject compositions. It introduces DynaIP, a plug-in for MM-DiT that combines a Dynamic Decoupling Strategy (DDS) to separate concept-specific from concept-agnostic information and a Hierarchical Mixture-of-Experts Feature Fusion Module (HMoE-FFM) to fuse CLIP's multi-layer features for flexible visual granularity. DDS enhances the balance between Concept Preservation and Prompt Following and improves scalability to multi-subject cases, while HMoE-FFM delivers high-fidelity details and tunable granularity through dynamic expert routing. The approach, trained on single-subject data, demonstrates strong performance on both single- and multi-subject PT2I benchmarks, with extensive ablations and user studies confirming the effectiveness of the two core components and their compatibility with base model extensions. Overall, DynaIP offers a scalable, flexible, and high-fidelity solution for personalized T2I generation with broad applicability and practical benefits.

Abstract

Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.

Paper Structure

This paper contains 31 sections, 9 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Representative results showcase the capabilities of DynaIP in: (a) Scalable zero-shot personalized text-to-image generation—spanning single-subject to multi-subject— trained solely on single-subject datasets. (b) Flexible control on the visual granularity of concept preservation, enabled by modulating fusion coefficients for image features across hierarchical levels. (c) Native compatibility with base model extensions, unlocking diverse application scenarios.
  • Figure 2: Limitations of existing adapter-based PT2I methods (e.g., ipadapter_ye2023ipflux_ipadapterhe2025disenvisioner), including (a) irreconcilable trade-off between CP and PF, (b) loss of fine-grained concept details, and (c) restricted scalability to directly extend SS-PT2I to MS-PT2I via mask-guided feature injection. Our proposed DynaIP addresses all these challenges.
  • Figure 3: Training and inference pipeline of (a-b) vanilla IP-Adapter and (c-d) our DynaIP.
  • Figure 4: Left: Architecture of our proposed HMoE-FFM. Right: Personalization results generated by injecting features from different layers of CLIP via cross-attentions, demonstrating that CLIP's hierarchical features can capture visual information at diverse granularity levels.
  • Figure 5: Qualitative comparisons on single- ( top) and multi-subject ( bottom) PT2I generation.
  • ...and 13 more figures