Table of Contents
Fetching ...

Reviving ConvNeXt for Efficient Convolutional Diffusion Models

Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara, Jong Chul Ye, Romann Weber, Vinicius Azevedo

TL;DR

The fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling, is introduced, demonstrating that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models.

Abstract

Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness--the attributes that established ConvNets as the efficient vision backbone--have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7$\times$ and 7.5$\times$ fewer training steps at 256$\times$256 and 512$\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.

Reviving ConvNeXt for Efficient Convolutional Diffusion Models

TL;DR

The fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling, is introduced, demonstrating that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models.

Abstract

Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness--the attributes that established ConvNets as the efficient vision backbone--have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7 and 7.5 fewer training steps at 256256 and 512512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.
Paper Structure (44 sections, 5 equations, 20 figures, 17 tables)

This paper contains 44 sections, 5 equations, 20 figures, 17 tables.

Figures (20)

  • Figure 1: Is scalability exclusive to transformers? Our Fully Convolutional Diffusion Model (FCDM) exhibits clear scalability: it is more efficient and achieves better convergence than Diffusion Transformers (DiTs). Bubble size indicates the FLOPs of each diffusion model. Across all scales (ordered by parameter count).
  • Figure 2: The Fully Convolutional Diffusion Model (FCDM) architecture. (a) Details of the ConvNeXt block. (b) Our FCDM block, which incorporates conditioning via adaptive layer normalization. (c) We train conditional latent FCDMs. The input latent is processed by multiple FCDM blocks arranged in an easily scalable U-shaped architecture.
  • Figure 3: Simple illustration of DiCo and FCDM block. Both architectures share a similar high-level structure, but FCDM adopts an inverted bottleneck that expands channels for richer representations while keeping the computational cost of depthwise convolution unchanged. DiCo employs CCA with an additional 1$\times$1 convolution, whereas FCDM uses GRN, requiring no extra pointwise convolutions. FCDM also does not include DiCo’s feed-forward module, resulting in a simpler and more efficient block.
  • Figure 4: FCDM improves FID across all model scales. FID-50K over training iterations for both DiT and FCDM. Across all model scales, FCDM converges much faster.
  • Figure 5: Benchmarking class-conditional image generation performance and efficiency on ImageNet 256$\times$256.Left: FID versus total training cost. Right: FID versus throughput. One zettaFLOP corresponds to $10^{21}$ FLOPs ($10^{12}$ GFLOPs). A training iteration is assumed to cost about $3\times$ one evaluation (forward + backward to inputs + backward to weights). Red denotes fully convolutional, Green denotes hybrid, and Blue denotes fully transformer-based models.
  • ...and 15 more figures