DP-Adapter: Dual-Pathway Adapter for Boosting Fidelity and Text Consistency in Customizable Human Image Generation
Ye Wang, Xuping Xie, Lanjun Wang, Zili Yi, Rui Ma
TL;DR
DP-Adapter tackles the challenge of preserving identity fidelity while maintaining textual consistency in personalized human image generation. It introduces two region-specific adapters—Identity-Enhancing Adapter for visually sensitive regions and Textual-Consistency Adapter for text-sensitive regions—and a Fine-Grained Feature-Level Blending module to fuse hierarchical features, enabling high-fidelity portraits that stay faithful to textual prompts. Quantitative results show state-of-the-art Face Score and competitive CLIP-IT, with qualitative evidence of natural, text-aligned images across complex backgrounds. The approach supports applications like headshot-to-full-body generation, age editing, old-photo rejuvenation, and expression control, offering a practical, safety-aware tool for personalized content creation.
Abstract
With the growing popularity of personalized human content creation and sharing, there is a rising demand for advanced techniques in customized human image generation. However, current methods struggle to simultaneously maintain the fidelity of human identity and ensure the consistency of textual prompts, often resulting in suboptimal outcomes. This shortcoming is primarily due to the lack of effective constraints during the simultaneous integration of visual and textual prompts, leading to unhealthy mutual interference that compromises the full expression of both types of input. Building on prior research that suggests visual and textual conditions influence different regions of an image in distinct ways, we introduce a novel Dual-Pathway Adapter (DP-Adapter) to enhance both high-fidelity identity preservation and textual consistency in personalized human image generation. Our approach begins by decoupling the target human image into visually sensitive and text-sensitive regions. For visually sensitive regions, DP-Adapter employs an Identity-Enhancing Adapter (IEA) to preserve detailed identity features. For text-sensitive regions, we introduce a Textual-Consistency Adapter (TCA) to minimize visual interference and ensure the consistency of textual semantics. To seamlessly integrate these pathways, we develop a Fine-Grained Feature-Level Blending (FFB) module that efficiently combines hierarchical semantic features from both pathways, resulting in more natural and coherent synthesis outcomes. Additionally, DP-Adapter supports various innovative applications, including controllable headshot-to-full-body portrait generation, age editing, old-photo to reality, and expression editing.
