Table of Contents
Fetching ...

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

Zutao Jiang, Guian Fang, Jianhua Han, Guansong Lu, Hang Xu, Shengcai Liao, Xiaojun Chang, Xiaodan Liang

TL;DR

The paper tackles misalignment between text prompts and generated images in text-to-image diffusion models. It introduces RealignDiff, a two-stage coarse-to-fine semantic re-alignment framework that combines a caption-reward-guided global alignment with local dense-captioning and re-weighted attention to refine details. The method leverages BLIP-2 for caption generation, RAM and GPT-4 for local descriptions, Grounded-SAM for masks, and a ReFL-based training regime, building on Stable Diffusion 1.5. Empirical results on MS-COCO and ViLG-300 show substantial gains in image fidelity and semantic alignment over eight baselines, supported by quantitative metrics and human evaluations.

Abstract

Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from textual descriptions. However, these approaches have faced challenges in precisely aligning the generated visual content with the textual concepts described in the prompts. In this paper, we propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff, aimed at improving the alignment between text and images in text-to-image diffusion models. In the coarse semantic re-alignment phase, a novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. Subsequently, the fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view. Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the proposed two-stage coarse-to-fine semantic re-alignment method outperforms other baseline re-alignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt.

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

TL;DR

The paper tackles misalignment between text prompts and generated images in text-to-image diffusion models. It introduces RealignDiff, a two-stage coarse-to-fine semantic re-alignment framework that combines a caption-reward-guided global alignment with local dense-captioning and re-weighted attention to refine details. The method leverages BLIP-2 for caption generation, RAM and GPT-4 for local descriptions, Grounded-SAM for masks, and a ReFL-based training regime, building on Stable Diffusion 1.5. Empirical results on MS-COCO and ViLG-300 show substantial gains in image fidelity and semantic alignment over eight baselines, supported by quantitative metrics and human evaluations.

Abstract

Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from textual descriptions. However, these approaches have faced challenges in precisely aligning the generated visual content with the textual concepts described in the prompts. In this paper, we propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff, aimed at improving the alignment between text and images in text-to-image diffusion models. In the coarse semantic re-alignment phase, a novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. Subsequently, the fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view. Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the proposed two-stage coarse-to-fine semantic re-alignment method outperforms other baseline re-alignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt.
Paper Structure (15 sections, 10 equations, 10 figures, 4 tables)

This paper contains 15 sections, 10 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Visual comparison of generated images from various text-to-image diffusion models (ImageReward xu2023imagereward, Structure Diffusion feng2023trainingfree, Stable Diffusion XL Rombach_2022_CVPR, PixArt-$\alpha$chen2023pixartalpha and MidJourney\ref{['midjourney']}). The motivation behind our proposed RealignDiff is to address the misalignments and semantic discrepancies observed in prior methods. From top to bottom: Missing Main Objects (e.g., the green traffic light in the first row is absent); Attribute Misalignment (e.g., the second row fails to paint red on the top of the yellow vase); Attribute Interchange (e.g., the third row, intended to be black and white, is not in monochrome, with the notable absence of the black toilet seat as evidence of the mix-up). RealignDiff endeavors to fix these inconsistencies, ensuring images that are more aligned with the provided textual prompts.
  • Figure 2: The framework of our RealignDiff approach. (a) Coarse Semantic Re-alignment enables the objects described in the given text to appear in the generated images. (b) Fine Semantic Re-alignment accurately captures the attributes and relationships of the objects. (c) Caption Reward measures the similarity between the generated caption and the given prompt. (d) The local dense caption generation module provides guidance regarding the attributes and spatial arrangements of objects within the fine semantic re-alignment stage.
  • Figure 3: Input Image is aligned at a coarse level, focusing on objects. Output Image 1 illustrates the RAM process, also generating phases for text input. Text outputs 1 and 2 provide essential parameters (aligned attributes, weighted granularity) for the final generation, leading to output image 2 through fine-grained alignment.
  • Figure 4: Qualitative comparison of different methods. Our method achieves the best performance regarding the quantity of objects, leakage of attributes, and the binding of attributes. More cases are provided in the Appendix.
  • Figure 5: From left to right, respectively: RealignDiff(ours), SD-v1.5 Rombach_2022_CVPR, DenseDiffusion kim2023dense, Imagereward xu2023imagereward, Promptist hao2022optimizing, StructureDiffusion feng2023trainingfree, SD-XL Rombach_2022_CVPR and PixArt-$\alpha$chen2023pixartalpha
  • ...and 5 more figures