Table of Contents
Fetching ...

Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models

Giannis Daras, Weili Nie, Karsten Kreis, Alex Dimakis, Morteza Mardani, Nikola Borislavov Kovachki, Arash Vahdat

TL;DR

This paper views frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames, which allows for state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems.

Abstract

Using image models naively for solving inverse video problems often suffers from flickering, texture-sticking, and temporal inconsistency in generated videos. To tackle these problems, in this paper, we view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames. This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems. The function space diffusion models need to be equivariant with respect to the underlying spatial transformations. To ensure temporal consistency, we introduce a simple post-hoc test-time guidance towards (self)-equivariant solutions. Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems. We demonstrate the effectiveness of our method for video inpainting and $8\times$ video super-resolution, outperforming existing techniques based on noise transformations. We provide generated video results: https://giannisdaras.github.io/warped_diffusion.github.io/.

Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models

TL;DR

This paper views frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames, which allows for state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems.

Abstract

Using image models naively for solving inverse video problems often suffers from flickering, texture-sticking, and temporal inconsistency in generated videos. To tackle these problems, in this paper, we view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames. This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems. The function space diffusion models need to be equivariant with respect to the underlying spatial transformations. To ensure temporal consistency, we introduce a simple post-hoc test-time guidance towards (self)-equivariant solutions. Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems. We demonstrate the effectiveness of our method for video inpainting and video super-resolution, outperforming existing techniques based on noise transformations. We provide generated video results: https://giannisdaras.github.io/warped_diffusion.github.io/.

Paper Structure

This paper contains 30 sections, 1 theorem, 27 equations, 12 figures, 3 tables, 1 algorithm.

Key Result

Lemma A.1

Let $x$ be a random variable with positive density $p_x \in C^1 (\mathbb{R}^k)$. Let $\sigma > 0$ and $z \sim \mathcal{N} (0,Q)$ for some positive definite matrix $Q \in \mathbb{R}^{k \times k}$ and assume that $x \perp z$. Define the random variable and let $p_y \in C^\infty (\mathbb{R}^k)$ be the density of $y$. It holds that

Figures (12)

  • Figure 1: Inpainting results for "a robot sitting on a bench". As the input video shifts smoothly, our output frames stay consistent.
  • Figure 2: Visualization of Warped Diffusion applied to video super-resolution. (a) We develop a function space diffusion model that super-resolves images given samples from a Gaussian process (GP). To extend the image model to videos, (b) we extract warping transformations between consecutive input frames using optical flow. (c) We use the flow to warp the GP sample from the previous frame. (d) To ensure temporal consistency, we introduce equivariance self-guidance in the ODE sampler.
  • Figure 3: Self-warping error w.r.t. first frame for the inpainting task as we shift the input frame.
  • Figure 4: Warping errors w.r.t. previously generated frame in latent and pixel space for the inpainting task as we shift the input frame.
  • Figure 5: Warping errors w.r.t. first generated frame (top-row) and prev. generated frame (bottom row) for the $8\times$ super-resolution task for real videos.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Lemma A.1
  • proof