FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes
Ziying Pan, Kun Wang, Gang Li, Feihong He, Yongxuan Lai
TL;DR
FineDiffusion tackles the challenge of large-scale fine-grained image generation with diffusion models by finely tuning only a TieredEmbedder, bias terms, and normalization parameters while freezing the rest of a DiT backbone. It introduces hierarchical superclass–subclass embeddings and a fine-grained classifier-free guidance sampling method that leverages superclass information to improve fidelity and diversity. Experiments on iNaturalist 2021 mini and VegFru show state-of-the-art FID/LPIPS with only 1.77% of parameters updated and a 1.56x training speed-up, achieving an FID of 9.776 on 10,000 classes. This approach demonstrates the practicality of scalable, parameter-efficient diffusion for fine-grained generation and offers a path toward efficient deployment in large taxonomy settings.
Abstract
The class-conditional image generation based on diffusion models is renowned for generating high-quality and diverse images. However, most prior efforts focus on generating images for general categories, e.g., 1000 classes in ImageNet-1k. A more challenging task, large-scale fine-grained image generation, remains the boundary to explore. In this work, we present a parameter-efficient strategy, called FineDiffusion, to fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories. FineDiffusion significantly accelerates training and reduces storage overhead by only fine-tuning tiered class embedder, bias terms, and normalization layers' parameters. To further improve the image generation quality of fine-grained categories, we propose a novel sampling method for fine-grained image generation, which utilizes superclass-conditioned guidance, specifically tailored for fine-grained categories, to replace the conventional classifier-free guidance sampling. Compared to full fine-tuning, FineDiffusion achieves a remarkable 1.56x training speed-up and requires storing merely 1.77% of the total model parameters, while achieving state-of-the-art FID of 9.776 on image generation of 10,000 classes. Extensive qualitative and quantitative experiments demonstrate the superiority of our method compared to other parameter-efficient fine-tuning methods. The code and more generated results are available at our project website: https://finediffusion.github.io/.
