Table of Contents
Fetching ...

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek Kar, Mohammad Norouzi, Deqing Sun, David J. Fleet

TL;DR

This paper demonstrates that denoising diffusion models can effectively perform optical flow and monocular depth estimation without task-specific architectures or losses, and can provide Monte Carlo uncertainty via sampling. It introduces DDVM, a simple image-to-image diffusion framework trained with multi-task self-supervised pretraining and a synthetic+real supervised pipeline, augmented by infilling, step-unrolled denoising, and coarse-to-fine refinement to handle noisy ground truth. The approach achieves state-of-the-art zero-shot and competitive finetuned results on benchmarks such as NYU depth v2 ($REL=0.074$) and KITTI flow ($\text{Fl-all}=3.26\%$), while also capturing multi-modality and enabling missing-value imputation. These findings suggest diffusion models can serve as a generic, effective framework for dense vision tasks, with practical benefits in uncertainty quantification and potential for 3D scene generation conditioned on text or images.

Abstract

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

TL;DR

This paper demonstrates that denoising diffusion models can effectively perform optical flow and monocular depth estimation without task-specific architectures or losses, and can provide Monte Carlo uncertainty via sampling. It introduces DDVM, a simple image-to-image diffusion framework trained with multi-task self-supervised pretraining and a synthetic+real supervised pipeline, augmented by infilling, step-unrolled denoising, and coarse-to-fine refinement to handle noisy ground truth. The approach achieves state-of-the-art zero-shot and competitive finetuned results on benchmarks such as NYU depth v2 () and KITTI flow (), while also capturing multi-modality and enabling missing-value imputation. These findings suggest diffusion models can serve as a generic, effective framework for dense vision tasks, with practical benefits in uncertainty quantification and potential for 3D scene generation conditioned on text or images.

Abstract

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.
Paper Structure (27 sections, 1 equation, 17 figures, 18 tables, 1 algorithm)

This paper contains 27 sections, 1 equation, 17 figures, 18 tables, 1 algorithm.

Figures (17)

  • Figure 1: Examples of multi-modal prediction on depth (NYU) and optical flow (Sintel and KITTI). Each row shows an input image (or overlayed image pair for optical flow), a variance heat map from 8 samples, and 3 individual samples. Our model captures multi-modality in uncertain/ambiguous cases, such as reflective (e.g. mirror on NYU), transparent (e.g. vehicle window on KITTI), and translucent (e.g. fog on Sintel) regions. High variance also occurs at object boundaries, which are often challenging cases for optical flow, and also partially originate from noisy ground truth measurements for depth. See Fig. \ref{['fig:supp:multi_modal_samples_depth_NYU']}, \ref{['fig:supp:multi_modal_samples_depth_KITTI']}, \ref{['fig:supp:multi_modal_samples_flow_KITTI']} and \ref{['fig:supp:multi_modal_samples_flow_sintel']} for more examples.
  • Figure 2: Training architecture. Given ground truth flow/depth, we first infill missing values using interpolation. Then, we add noise to the label map and train a neural network to model the conditional distribution of the noise given the RGB image(s), noisy label, and time step. One can optionally unroll the denoising step(s) during training (with stop gradient) to bridge the distribution gap between training and inference for $y_t$.
  • Figure 3: Effects of adding synthetic datasets in pretraining. Diffusion models trained only with AutoFlow (AF) tend to provide very coarse flow estimates and can hallucinate shapes. The addition of FlyingThings (FT), Kubric (KU), and TartanAir (TA) remove the AF-induced bias toward polgonal-shaped regions, and significantly improve flow quality on fine detail, e.g. trees, thin structures, and motion boundaries.
  • Figure 4: Visual results comparing RAFT with our method after pretraining. Note that our method does much better on fine details and ambiguous regions.
  • Figure 5: Visual results comparing RAFT with our method after finetuning. Ours does much better on fine details and ambiguous regions.
  • ...and 12 more figures