Table of Contents
Fetching ...

EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Models

Kun Wang, Donglin Di, Tonghua Su, Lei Fan

TL;DR

EFDiT tackles semantic entanglement and insufficient detail in large-scale fine-grained diffusion-based generation by introducing a tiered embedder that fuses superclass and subclass labels, a FFT-based super-resolution approach during perceptual denoising, and a ProAttention mechanism for efficient self-attention. The method comprises three modules—HRIG, TE, and ProAttention—trained on iNaturalist 2021 and VegFru with limited parameter updates, achieving state-of-the-art or competitive FID/IS while updating only a small fraction of parameters. Experimental results show improved fine-grained fidelity and diversity, with ablations validating each component’s contribution. The work offers practical gains in generating high-quality, detailed fine-grained images with reduced computational cost, making diffusion-based fine-grained synthesis more scalable.

Abstract

Diffusion models are highly regarded for their controllability and the diversity of images they generate. However, class-conditional generation methods based on diffusion models often focus on more common categories. In large-scale fine-grained image generation, issues of semantic information entanglement and insufficient detail in the generated images still persist. This paper attempts to introduce a concept of a tiered embedder in fine-grained image generation, which integrates semantic information from both super and child classes, allowing the diffusion model to better incorporate semantic information and address the issue of semantic entanglement. To address the issue of insufficient detail in fine-grained images, we introduce the concept of super-resolution during the perceptual information generation stage, enhancing the detailed features of fine-grained images through enhancement and degradation models. Furthermore, we propose an efficient ProAttention mechanism that can be effectively implemented in the diffusion model. We evaluate our method through extensive experiments on public benchmarks, demonstrating that our approach outperforms other state-of-the-art fine-tuning methods in terms of performance.

EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Models

TL;DR

EFDiT tackles semantic entanglement and insufficient detail in large-scale fine-grained diffusion-based generation by introducing a tiered embedder that fuses superclass and subclass labels, a FFT-based super-resolution approach during perceptual denoising, and a ProAttention mechanism for efficient self-attention. The method comprises three modules—HRIG, TE, and ProAttention—trained on iNaturalist 2021 and VegFru with limited parameter updates, achieving state-of-the-art or competitive FID/IS while updating only a small fraction of parameters. Experimental results show improved fine-grained fidelity and diversity, with ablations validating each component’s contribution. The work offers practical gains in generating high-quality, detailed fine-grained images with reduced computational cost, making diffusion-based fine-grained synthesis more scalable.

Abstract

Diffusion models are highly regarded for their controllability and the diversity of images they generate. However, class-conditional generation methods based on diffusion models often focus on more common categories. In large-scale fine-grained image generation, issues of semantic information entanglement and insufficient detail in the generated images still persist. This paper attempts to introduce a concept of a tiered embedder in fine-grained image generation, which integrates semantic information from both super and child classes, allowing the diffusion model to better incorporate semantic information and address the issue of semantic entanglement. To address the issue of insufficient detail in fine-grained images, we introduce the concept of super-resolution during the perceptual information generation stage, enhancing the detailed features of fine-grained images through enhancement and degradation models. Furthermore, we propose an efficient ProAttention mechanism that can be effectively implemented in the diffusion model. We evaluate our method through extensive experiments on public benchmarks, demonstrating that our approach outperforms other state-of-the-art fine-tuning methods in terms of performance.

Paper Structure

This paper contains 13 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: High-resolution fine-grained image generation architecture diagram. The model incorporates the “Tiered Embedder" shown in the bottom right to introduce superclass information into the model, introduces the ProAttention mechanism to enhance training efficiency, and integrates the concept of super-resolution during the denoising process to generate high-quality fine-grained image pixels.
  • Figure 2: Comparison with fine-grained images generated by other algorithms.
  • Figure 3: Comparison between super-resolution and sampling in image generation.
  • Figure 4: Comparison between the ProAttention mechanism and Attention mechanism in image generation.