Table of Contents
Fetching ...

FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

Ziying Pan, Kun Wang, Gang Li, Feihong He, Yongxuan Lai

TL;DR

FineDiffusion tackles the challenge of large-scale fine-grained image generation with diffusion models by finely tuning only a TieredEmbedder, bias terms, and normalization parameters while freezing the rest of a DiT backbone. It introduces hierarchical superclass–subclass embeddings and a fine-grained classifier-free guidance sampling method that leverages superclass information to improve fidelity and diversity. Experiments on iNaturalist 2021 mini and VegFru show state-of-the-art FID/LPIPS with only 1.77% of parameters updated and a 1.56x training speed-up, achieving an FID of 9.776 on 10,000 classes. This approach demonstrates the practicality of scalable, parameter-efficient diffusion for fine-grained generation and offers a path toward efficient deployment in large taxonomy settings.

Abstract

The class-conditional image generation based on diffusion models is renowned for generating high-quality and diverse images. However, most prior efforts focus on generating images for general categories, e.g., 1000 classes in ImageNet-1k. A more challenging task, large-scale fine-grained image generation, remains the boundary to explore. In this work, we present a parameter-efficient strategy, called FineDiffusion, to fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories. FineDiffusion significantly accelerates training and reduces storage overhead by only fine-tuning tiered class embedder, bias terms, and normalization layers' parameters. To further improve the image generation quality of fine-grained categories, we propose a novel sampling method for fine-grained image generation, which utilizes superclass-conditioned guidance, specifically tailored for fine-grained categories, to replace the conventional classifier-free guidance sampling. Compared to full fine-tuning, FineDiffusion achieves a remarkable 1.56x training speed-up and requires storing merely 1.77% of the total model parameters, while achieving state-of-the-art FID of 9.776 on image generation of 10,000 classes. Extensive qualitative and quantitative experiments demonstrate the superiority of our method compared to other parameter-efficient fine-tuning methods. The code and more generated results are available at our project website: https://finediffusion.github.io/.

FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

TL;DR

FineDiffusion tackles the challenge of large-scale fine-grained image generation with diffusion models by finely tuning only a TieredEmbedder, bias terms, and normalization parameters while freezing the rest of a DiT backbone. It introduces hierarchical superclass–subclass embeddings and a fine-grained classifier-free guidance sampling method that leverages superclass information to improve fidelity and diversity. Experiments on iNaturalist 2021 mini and VegFru show state-of-the-art FID/LPIPS with only 1.77% of parameters updated and a 1.56x training speed-up, achieving an FID of 9.776 on 10,000 classes. This approach demonstrates the practicality of scalable, parameter-efficient diffusion for fine-grained generation and offers a path toward efficient deployment in large taxonomy settings.

Abstract

The class-conditional image generation based on diffusion models is renowned for generating high-quality and diverse images. However, most prior efforts focus on generating images for general categories, e.g., 1000 classes in ImageNet-1k. A more challenging task, large-scale fine-grained image generation, remains the boundary to explore. In this work, we present a parameter-efficient strategy, called FineDiffusion, to fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories. FineDiffusion significantly accelerates training and reduces storage overhead by only fine-tuning tiered class embedder, bias terms, and normalization layers' parameters. To further improve the image generation quality of fine-grained categories, we propose a novel sampling method for fine-grained image generation, which utilizes superclass-conditioned guidance, specifically tailored for fine-grained categories, to replace the conventional classifier-free guidance sampling. Compared to full fine-tuning, FineDiffusion achieves a remarkable 1.56x training speed-up and requires storing merely 1.77% of the total model parameters, while achieving state-of-the-art FID of 9.776 on image generation of 10,000 classes. Extensive qualitative and quantitative experiments demonstrate the superiority of our method compared to other parameter-efficient fine-tuning methods. The code and more generated results are available at our project website: https://finediffusion.github.io/.
Paper Structure (20 sections, 1 equation, 7 figures, 4 tables)

This paper contains 20 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Examples of images generated by the FineDiffusion model. The top row shows Bird images at a resolution of 512$\times$512 pixels, generated using the iNaturalist 2021 mini dataset. The bottom two rows are at a resolution of 256$\times$256 pixels, showing images of Plant, Mammal, Arthropoda, and Ray-finned fish generated from the same dataset. Each image is from a distinct fine-grained class, and every three images in the same row belong to one superclass. The generated results of fine-grained class images under the same superclass showcase the powerful fine-grained image generation capability of our method.
  • Figure 2: The overall FID score is computed for the fine-tuned DiT model in the iNaturalist 2021 mini dataset. The size of each data point corresponds to the duration of training, with smaller points indicating faster training speed. FineDiffusion exhibits outstanding performance by attaining superior FID results, all the while demanding a reduced computational workload and fewer parameters.
  • Figure 3: The proposed FineDiffusion method involves the preservation of the majority of parameters within the pre-trained DiT model. We introduce a specialized TieredEmbedder optimized for generating fine-grained categories. Notably, we exclusively fine-tune the tiered label embedding component, bias terms, and normalization terms, and this fine-tuning process affects merely 1.77% of the pre-trained model's parameters. This strategic approach showcases an effective means of achieving efficient parameter fine-tuning. (Zoom-in for the best view.)
  • Figure 4: Images generated by the FineDiffusion model after fine-tuning on the VegFru dataset. In each row, every pair of images belongs to the same superclass, namely: Melon, Eggplant, Citrus fruit, Litchies, Berry fruit, Drupe, Green-leafy vegetable, Wild vegetable and Collective fruit. Our approach effectively generates subclasses with visually similar features within the same superclass.
  • Figure 5: Comparison of generated results for several mammalian categories across different methods. Each method is assigned the same class label input and the same random seed for sampling. The authentic images of real species are also presented. This comparison underscores FineDiffusion's capability to generate species images aligned with actual categories, and these images are more photorealistic compared to other methods. (Zoom-in for the best view.)
  • ...and 2 more figures