Table of Contents
Fetching ...

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar

TL;DR

This study investigates how latent diffusion models scale, with a focus on sampling efficiency under limited inference budgets. By training a family of LDMs from 39M to 5B parameters and evaluating across sampling steps, samplers, and downstream tasks, the authors reveal that smaller models can outperform larger ones at equivalent inference costs, and that this behavior is robust to diffusion samplers and distillation. They also show that pretraining compute and downstream finetuning determine downstream performance, and that the observed efficiency trends persist across real-world super-resolution and DreamBooth tasks. The findings offer practical guidance for scaling strategies that optimize inference efficiency, enabling more accessible deployment of LDM-based systems in resource-constrained settings.

Abstract

We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets.

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

TL;DR

This study investigates how latent diffusion models scale, with a focus on sampling efficiency under limited inference budgets. By training a family of LDMs from 39M to 5B parameters and evaluating across sampling steps, samplers, and downstream tasks, the authors reveal that smaller models can outperform larger ones at equivalent inference costs, and that this behavior is robust to diffusion samplers and distillation. They also show that pretraining compute and downstream finetuning determine downstream performance, and that the observed efficiency trends persist across real-world super-resolution and DreamBooth tasks. The findings offer practical guidance for scaling strategies that optimize inference efficiency, enabling more accessible deployment of LDM-based systems in resource-constrained settings.

Abstract

We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets.
Paper Structure (23 sections, 24 figures, 1 table)

This paper contains 23 sections, 24 figures, 1 table.

Figures (24)

  • Figure 1: Text-to-image results from our scaled LDMs (39M - 2B), highlighting the improvement in visual quality with increased model size (note: 39M model is the exception). All images generated using 50-step DDIM sampling and CFG rate of 7.5. We use representative prompts from PartiPrompts yu2022scaling, including "a professional photo of a sunset behind the grand canyon.", "Dogs sitting around a poker table with beer bottles and chips. Their hands are holding cards.", 'Portrait of anime girl in mechanic armor in night Tokyo.", "a teddy bear on a skateboard.", "a pixel art corgi pizza.", "Snow mountain and tree reflection in the lake.", "a propaganda poster depicting a cat dressed as french emperor napoleon holding a piece of cheese.", "a store front that has the word ‘LDMs’ written on it.", and "ten red apples.". Check our supplement for additional visual comparisons.
  • Figure 2: Our scaled latent diffusion models vary in the number of filters within the denoising U-Net. Other modules remain consistent. Smooth channel scaling (64 to 768) within residual blocks yields models ranging from 39M to 5B parameters. For downstream tasks requiring image input, we use an encoder to generate a latent code; this code is then concatenated with the noise vector in the denoising U-Net.
  • Figure 3: In text-to-image generation using 50-step DDIM sampling and CFG rate of 7.5, we observe consistent trends across various model sizes in how quality metrics (FID and CLIP scores) relate to training compute (i.e., the total GFLOPS spend on training). Under moderate training resources, training compute is the most relevant factor dominating quality.
  • Figure 4: In $4\times$ real image super-resolution using 50-step DDIM sampling, FID and LPIPS scores reveal an interesting divergence. Model size drives FID score improvement, while training compute most impacts LPIPS score. Despite this, visual assessment (Fig. \ref{['fig:sr']}) confirms the importance of model size for superior detail recovery (similarly as observed in the text-to-image pretraining).
  • Figure 5: In 4$\times$ super-resolution using 50-step DDIM sampling, visual quality directly improves with increased model size. As these scaled models vary in pretraining performance, the results clearly demonstrate that pretraining boosts super-resolution capabilities in both quantitative (Fig \ref{['fig:sr_compute']}) and qualitative ways. Additional results are given in supplementary material.
  • ...and 19 more figures