Table of Contents
Fetching ...

AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

Yiheng Lin, Shifang Zhao, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei

TL;DR

AlignGen tackles misalignment between textual priors and visual priors in zero-shot personalized image generation by introducing Cross-Modality Prior Alignment, which uses a learnable token $S_*$, a Deviation Extraction Module, and a selective cross-modal attention mask to align priors and preserve reference content without test-time fine-tuning. The method leverages a diffusion-transformer backbone (FLUX/DiT) and integrates reference imagery through redux tokens while updating only the first concept token to avoid drift. Training employs reference dropout and random concept-name substitutions to robustify $S_*'$ against misalignment, and a targeted attention mask reinforces associations between concept words and reference tokens. On DreamBench++ benchmarks, AlignGen achieves a superior balance between concept preservation and prompt following, outperforming zero-shot baselines and rivaling some test-time optimization approaches, with strong qualitative results and evidence of partial generalization to multi-reference scenarios.

Abstract

Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.

AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

TL;DR

AlignGen tackles misalignment between textual priors and visual priors in zero-shot personalized image generation by introducing Cross-Modality Prior Alignment, which uses a learnable token , a Deviation Extraction Module, and a selective cross-modal attention mask to align priors and preserve reference content without test-time fine-tuning. The method leverages a diffusion-transformer backbone (FLUX/DiT) and integrates reference imagery through redux tokens while updating only the first concept token to avoid drift. Training employs reference dropout and random concept-name substitutions to robustify against misalignment, and a targeted attention mask reinforces associations between concept words and reference tokens. On DreamBench++ benchmarks, AlignGen achieves a superior balance between concept preservation and prompt following, outperforming zero-shot baselines and rivaling some test-time optimization approaches, with strong qualitative results and evidence of partial generalization to multi-reference scenarios.

Abstract

Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.

Paper Structure

This paper contains 13 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Explanation of Cross-Modality Prior Misalignment. The upper branch incorporates both the visual and textual priors in the multi-modal attention mechanism, while the lower branch integrates only the textual prior.
  • Figure 2: Visualization of reconstruction of input images by the Redux model. The prior encoded in the redux token ensures that the generated images preserve the color, shape, and style of the input, but it does not retain the fine details of the subjects.
  • Figure 3: Overview of our pipeline. The prompt and reference image are first encoded into text tokens $c_{text}$ and redux tokens $c_{redux}$. The Deviation Extraction Module (DEM) then updates $c_{text}$ to $c_{text}'$, which becomes more aligned with the visual prior from the reference token $c_{ref}$. Finally, all tokens are concatenated and processed using multi-modal attention. Note that both the reference token and noisy image token share the same modules, with LoRA applied only to the reference token. Modules marked with a flame symbol are trainable, while the others remain frozen.
  • Figure 4: Qualitative comparison of the results on the Dreambench++ benchmark.
  • Figure 5: Qualitative results of two-subject generation without additional training. Note that the model is trained solely on single-subject datasets. The outputs correspond to different random seeds.
  • ...and 2 more figures