Table of Contents
Fetching ...

Exploiting Diffusion Prior for Generalizable Dense Prediction

Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang

TL;DR

This work tackles the domain gap between generative diffusion outputs and dense prediction tasks by reusing pre-trained text-to-image diffusion priors. It introduces DMP, a deterministic diffusion framework that interpolates between input images and desired outputs, enabling reliable predictions across depth, normals, segmentation, and intrinsic decomposition while preserving generalization via LoRA-based fine-tuning. With only ~10K labeled bedroom images for training, DMP achieves faithful in-domain and out-of-domain predictions, often surpassing state-of-the-art baselines. The approach demonstrates the potential of diffusion priors for broadly generalizable dense understanding with limited labeled data, signifying a step toward ultimate generalizability in visual perception tasks.

Abstract

Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate due to the immitigable domain gap. We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks. To address the misalignment between deterministic prediction tasks and stochastic T2I models, we reformulate the diffusion process through a sequence of interpolations, establishing a deterministic mapping between input RGB images and output prediction distributions. To preserve generalizability, we use low-rank adaptation to fine-tune pre-trained models. Extensive experiments across five tasks, including 3D property estimation, semantic segmentation, and intrinsic image decomposition, showcase the efficacy of the proposed method. Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.

Exploiting Diffusion Prior for Generalizable Dense Prediction

TL;DR

This work tackles the domain gap between generative diffusion outputs and dense prediction tasks by reusing pre-trained text-to-image diffusion priors. It introduces DMP, a deterministic diffusion framework that interpolates between input images and desired outputs, enabling reliable predictions across depth, normals, segmentation, and intrinsic decomposition while preserving generalization via LoRA-based fine-tuning. With only ~10K labeled bedroom images for training, DMP achieves faithful in-domain and out-of-domain predictions, often surpassing state-of-the-art baselines. The approach demonstrates the potential of diffusion priors for broadly generalizable dense understanding with limited labeled data, signifying a step toward ultimate generalizability in visual perception tasks.

Abstract

Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate due to the immitigable domain gap. We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks. To address the misalignment between deterministic prediction tasks and stochastic T2I models, we reformulate the diffusion process through a sequence of interpolations, establishing a deterministic mapping between input RGB images and output prediction distributions. To preserve generalizability, we use low-rank adaptation to fine-tune pre-trained models. Extensive experiments across five tasks, including 3D property estimation, semantic segmentation, and intrinsic image decomposition, showcase the efficacy of the proposed method. Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.
Paper Structure (36 sections, 8 equations, 23 figures, 16 tables)

This paper contains 36 sections, 8 equations, 23 figures, 16 tables.

Figures (23)

  • Figure 1: Generalized dense prediction. (left) We leverage the pre-trained text-to-image diffusion model Rombach_2022_CVPR as a prior for various dense prediction tasks. (right) With only a small amount of labeled training data in a limited domain (i.e., 10K bedroom images with labels) for each task, our method performs favorably against SOTA predictors Kar_2022_CVPRzoedepthEVA02 on arbitrary images.
  • Figure 2: Deterministic diffusion process. We formulate the diffusion process as a chain of interpolations between an input image $x$ and output $y$. The U-Net model is fine-tuned to gradually transform the input $x$ to the desired dense prediction $y$.
  • Figure 3: 3D property estimation of arbitrary input images. The first row shows the input images, while the remaining rows present the normals and depth estimated by different approaches. The proposed DMP method gives faithful estimation, even on the images where the off-the-shelf Kar_2022_CVPRzoedepth schemes fail to handle.
  • Figure 4: Qualitative results. The first row shows the input images. In the following, every two rows show the results predicted by the off-the-shelf predictors (which we considered as pseudo ground truth) and those by the proposed method.
  • Figure 5: Qualitative results on semantic segmentation. The first, second, and third rows respectively show the input images, pseudo ground truth predicted by an off-the-shelf model, and our results. The out-of-domain samples in (b) are bedroom images in diverse styles.
  • ...and 18 more figures