AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment
Yiheng Lin, Shifang Zhao, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei
TL;DR
AlignGen tackles misalignment between textual priors and visual priors in zero-shot personalized image generation by introducing Cross-Modality Prior Alignment, which uses a learnable token $S_*$, a Deviation Extraction Module, and a selective cross-modal attention mask to align priors and preserve reference content without test-time fine-tuning. The method leverages a diffusion-transformer backbone (FLUX/DiT) and integrates reference imagery through redux tokens while updating only the first concept token to avoid drift. Training employs reference dropout and random concept-name substitutions to robustify $S_*'$ against misalignment, and a targeted attention mask reinforces associations between concept words and reference tokens. On DreamBench++ benchmarks, AlignGen achieves a superior balance between concept preservation and prompt following, outperforming zero-shot baselines and rivaling some test-time optimization approaches, with strong qualitative results and evidence of partial generalization to multi-reference scenarios.
Abstract
Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.
