Table of Contents
Fetching ...

Scaling Properties of Diffusion Models for Perceptual Tasks

Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik

TL;DR

This paper unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and shows how diffusion models benefit from scaling training and test-time compute for these perceptual tasks.

Abstract

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .

Scaling Properties of Diffusion Models for Perceptual Tasks

TL;DR

This paper unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and shows how diffusion models benefit from scaling training and test-time compute for these perceptual tasks.

Abstract

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .

Paper Structure

This paper contains 21 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: A Unified Framework: We fine-tune a pre-trained Diffusion Model (DM), for visual perception tasks. We take a RGB image, and a conditional image (i.e. next video frame, occlusion mask, etc.), along with the noised image of the ground truth prediction. Our model generates predictions for visual tasks such as depth estimation, optical flow prediction, and amodal segmentation, based on the conditional task embedding. We train a generalist model that can perform all three tasks with exceptional performance.
  • Figure 2: Scaling at Model Size: For generative pre-training of DiT models, we observe clear power law scaling behavior as we increase the model size.
  • Figure 3: Effect of Model Size: We fine-tune a1-a6 models on the Hypersim dataset for 30K iterations with an exponential decay learning rate schedule from $3e$-$5$ to $3e$-$7$. We observe a strong correlation between the fine-tuning loss scaling law and validation metric scaling laws.
  • Figure 4: Effect of Scaling Model Pre-training Compute on Depth Estimation: (a) Depth Absolute Relative Error vs. MACs. (b) Depth Delta1 Error vs. MACs. We pre-train four a4 models with 60K, 80K, 100K, and 120K steps. These models are then fine-tuned for 30K steps on the Hypersim depth estimation dataset. We observe a clear power law as we increase the DiT pre-training compute across depth estimation validation metrics.
  • Figure 5: Effect of Image Resolution. We fine-tune DiT-XL and DiT-MoE L/2 models with resolutions of $256 \times 256$ and $512 \times 512$. We observe a power law when increasing image resolution during training. By scaling the number of tokens per image by 4$\times$, we achieve strong performance on Depth Absolute Error, displaying the effect of increasing total dataset tokens for dense visual perception tasks such as depth estimation.
  • ...and 6 more figures