Table of Contents
Fetching ...

Semantically Consistent Video Inpainting with Conditional Diffusion Models

Dylan Green, William Harvey, Saeid Naderiparizi, Matthew Niedoba, Yunpeng Liu, Xiaoxuan Liang, Jonathan Lavington, Ke Zhang, Vasileios Lioutas, Setareh Dabiri, Adam Scibior, Berend Zwartsenberg, Frank Wood

TL;DR

This paper reframe video inpainting as a conditional generative modeling problem and presents a framework for solving such problems with conditional video diffusion models, introducing inpainting-specific sampling schemes which capture crucial long-range dependencies in the context, and devise a novel method for conditioning on the known pixels in incomplete frames.

Abstract

Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper, we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We introduce inpainting-specific sampling schemes which capture crucial long-range dependencies in the context, and devise a novel method for conditioning on the known pixels in incomplete frames. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.

Semantically Consistent Video Inpainting with Conditional Diffusion Models

TL;DR

This paper reframe video inpainting as a conditional generative modeling problem and presents a framework for solving such problems with conditional video diffusion models, introducing inpainting-specific sampling schemes which capture crucial long-range dependencies in the context, and devise a novel method for conditioning on the known pixels in incomplete frames.

Abstract

Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper, we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We introduce inpainting-specific sampling schemes which capture crucial long-range dependencies in the context, and devise a novel method for conditioning on the known pixels in incomplete frames. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
Paper Structure (36 sections, 4 equations, 18 figures, 5 tables)

This paper contains 36 sections, 4 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Inpainting results on a challenging example from our BDD-Inpainting dataset. The first row shows the input to the model, with the occlusion mask shown in green. Note that the left side of the car in the center lane is never visible, and it becomes fully occluded soon after the first frame. Our method (second row) is capable of generating a plausible completion of the car and realistically propagating it through time. On the contrary, in the result from the best-competing method, ProPainter propainter (third row), the car is not completed and quickly fades away.
  • Figure 2: An example inpainting task from our Traffic-Scenes dataset. The black car becomes occluded near the top of the roundabout and emerges many frames later near the bottom. Inpainting such an example requires the ability to model plausible vehicle behaviour.
  • Figure 3: Sampling schemes from fdm for generating videos of length $N=31$ while accessing only $K=8$ frames at a time. Each row of each subfigure depicts a different stage of our sampling process, starting from the top row and working down. Each column represents one video frame. Within each stage, frames shown in cyan are being sampled conditioned on the values of previously-sampled frames shown in dark red. Frames shown in white are not yet generated. By the end, all frames are generated and shown in light gray.
  • Figure 4: Example model inputs during training. Left: Visualizations of a 5-frame video $\mathbf{V}$, a corresponding mask $\mathbf{M}$, and the resulting known pixel values $\mathbf{V}\odot\mathbf{M}$. Center: Collated training inputs if $\mathcal{X} = [0, 3]$ and $\mathcal{Y} = [2]$. The observations in $\mathbf{y}$ are then the whole of frame 2 and known pixel values in frames 0 and 3. The task is to predict the unknown pixel values in frames 0 and 3. Right: Inputs fed to the neural network, with noise added to pixel values in $\mathbf{x}$ but not those in $\mathbf{y}$. The task is then to predict the noise $\epsilon$. For simplicity we do not show inputs $t$, $\mathbf{M}_{\mathcal{X},\mathcal{Y}}$, or $\mathcal{X}\oplus\mathcal{Y}$.
  • Figure 5: Sampling schemes visualizations similar to \ref{['fig:old-sampling-schemes']}. In addition we now also condition on observed pixel values in frames that can also contain unknown pixel values. Frames where we do so are shown in bright red and the color scheme is otherwise the same as in \ref{['fig:old-sampling-schemes']}.
  • ...and 13 more figures