Table of Contents
Fetching ...

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

Wan-Cyuan Fan, Yen-Chun Chen, Dongdong Chen, Yu Cheng, Lu Yuan, Yu-Chiang Frank Wang

TL;DR

Frido introduces a multi-scale, coarse-to-fine diffusion framework that encodes images into a feature pyramid of latent maps and denoises them with a shared PyU-Net. By modeling information progressively from low-level details to high-level structure, Frido achieves state-of-the-art results on complex scene tasks including text-to-image, layout-to-image, scene-graph-to-image, and label-to-image, while delivering improved inference efficiency over traditional diffusion models. Key innovations include the MS-VQGAN-based multi-scale latent encoding and the coarse-to-fine modulation within a single, parameter-efficient denoiser. The approach demonstrates strong performance across diverse conditioning modalities and provides practical benefits in speed and scalability for high-fidelity scene synthesis.

Abstract

Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at https://github.com/davidhalladay/Frido.

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

TL;DR

Frido introduces a multi-scale, coarse-to-fine diffusion framework that encodes images into a feature pyramid of latent maps and denoises them with a shared PyU-Net. By modeling information progressively from low-level details to high-level structure, Frido achieves state-of-the-art results on complex scene tasks including text-to-image, layout-to-image, scene-graph-to-image, and label-to-image, while delivering improved inference efficiency over traditional diffusion models. Key innovations include the MS-VQGAN-based multi-scale latent encoding and the coarse-to-fine modulation within a single, parameter-efficient denoiser. The approach demonstrates strong performance across diverse conditioning modalities and provides practical benefits in speed and scalability for high-fidelity scene synthesis.

Abstract

Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at https://github.com/davidhalladay/Frido.
Paper Structure (57 sections, 9 equations, 17 figures, 14 tables)

This paper contains 57 sections, 9 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: Illustration of Frido. Given a cross-modal condition, Frido generates images in ① a coarse-to-fine manner from structure to object details, producing outputs with ② high semantic correctness and quality. Note that existing models such as the LDMs are not designed to distinguish between high/low-level visual information.
  • Figure 2: Generated examples of Frido on various tasks. From left to right, we show the examples of text-to-image (T2I) on COCO 2014, scene-graph-to-image (SG2I) on Visual Genome, layout-to-image (Layout2I) on COCO-stuff, unconditional image generation on Landscape, CelebA, LSUN-bed. All images are 256x256 resolution. Note that, for conditional generation, we adopt classifier-free guidance with s=1.5 while testing. Please refer to supplementary for full-scale version.
  • Figure 3: Overview of Frido (best viewed in color). How MS-VQGAN encodes an image into multi-scale feature maps $\vectorbold{z}^1_0, \vectorbold{z}^2_0$ is illustrated in (a). The quantization enables VQ-VAE learning; and the fusion allows merging all representations from high to low level for the decoder to reconstruct an image. The upper half of (b) demonstrate the coarse-to-fine process, where the denoising is completed for high-level first, and then the lower one. The lower half of (b) details each denoising step. A U-Net is shared not only across timestep $t$ but also the scale level $s$. Coarse-to-fine gating will be explained in Figure \ref{['arch_2']}.
  • Figure 4: Framework of coarse-to-fine modulation in PyU-Net. Note that we ignore some intermediate convolution layers and SiLU layers for simplification.
  • Figure 5: Model ablation on COCO T2I and VG SG2I. CFM denotes our coarse-to-fine modulation.
  • ...and 12 more figures