Table of Contents
Fetching ...

Aligning Text to Image in Diffusion Models is Easier Than You Think

Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, Jong Chul Ye

TL;DR

SoftREPA tackles residual text–image misalignment in diffusion-based T2I models by introducing learnable soft tokens and a contrastive alignment loss, all while freezing the backbone to keep training lightweight. The method links contrastive representation learning with mutual information to explicitly boost semantic coherence between modalities. Empirically, SoftREPA improves text alignment and editing performance across multiple Stable Diffusion backbones with under 1M additional parameters and negligible speed impact. It also proves complementary to diffusion RL approaches and provides a flexible, model-agnostic augmentation to enhance multimodal generation and editing tasks.

Abstract

While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Some approaches address this issue by fine-tuning models in terms of preference optimization, etc., which require tailored datasets. Orthogonal to these methods, we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages existing dataset as both positive and negative pairs. To enable efficient alignment with pretrained models, we propose SoftREPA- a lightweight contrastive fine-tuning strategy that leverages soft text tokens for representation alignment. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.

Aligning Text to Image in Diffusion Models is Easier Than You Think

TL;DR

SoftREPA tackles residual text–image misalignment in diffusion-based T2I models by introducing learnable soft tokens and a contrastive alignment loss, all while freezing the backbone to keep training lightweight. The method links contrastive representation learning with mutual information to explicitly boost semantic coherence between modalities. Empirically, SoftREPA improves text alignment and editing performance across multiple Stable Diffusion backbones with under 1M additional parameters and negligible speed impact. It also proves complementary to diffusion RL approaches and provides a flexible, model-agnostic augmentation to enhance multimodal generation and editing tasks.

Abstract

While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Some approaches address this issue by fine-tuning models in terms of preference optimization, etc., which require tailored datasets. Orthogonal to these methods, we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages existing dataset as both positive and negative pairs. To enable efficient alignment with pretrained models, we propose SoftREPA- a lightweight contrastive fine-tuning strategy that leverages soft text tokens for representation alignment. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.

Paper Structure

This paper contains 31 sections, 22 equations, 14 figures, 11 tables, 1 algorithm.

Figures (14)

  • Figure 1: Representative results for image generation and image editing. SoftREPA provides much improved text-to-image alignment by introducing a negligible size of learnable soft tokens.
  • Figure 2: Network architecture and algorithmic concept of SoftREPA. (a) Learnable soft tokens of each layer are prepended to the text features across the upper layers. (b) The soft tokens are optimized to contrastively match the score with positively conditioned predicted noise while repelling the score from negatively conditioned predicted noise. This process implicitly sharpens the joint probability distribution of images and text by reducing the log probability of negatively paired conditions.
  • Figure 3: The qualitative results of text-to-image generation comparing SD3 and SD3 with proposed method. The given text is from COCO and Pixart dataset.
  • Figure 4: The qualitative results of text guided image editing comparing on SD3 with the proposed method. The FlowEdit kulikov2024flowedit is used as the editing method for both baseline and SoftREPA.
  • Figure 5: Quantitative comparison of baseline and SoftREPA on DIV2K and Cat2Dog editing using CLIP Score and LPIPS with various CFG scales.
  • ...and 9 more figures