Table of Contents
Fetching ...

Coherent Video Inpainting Using Optical Flow-Guided Efficient Diffusion

Bohai Gu, Hao Luo, Song Guo, Peiran Dong, Qihua Zhou

TL;DR

FloED tackles the challenge of text-guided video inpainting by integrating optical flow as a motion prior into diffusion models. It introduces a dual-branch architecture with a time-agnostic flow completion branch and multi-scale flow adapters, augmented by an anchor-frame strategy and training-free latency reductions (latent interpolation and flow attention caching). Empirical results on background restoration and object removal show FloED achieving state-of-the-art quality and efficiency, with strong temporal coherence and text alignment. The approach offers practical impact by enabling faster, more coherent diffusion-based video inpainting and provides a public benchmark and code base for further research.

Abstract

The text-guided video inpainting technique has significantly improved the performance of content generation applications. A recent family for these improvements uses diffusion models, which have become essential for achieving high-quality video inpainting results, yet they still face performance bottlenecks in temporal consistency and computational efficiency. This motivates us to propose a new video inpainting framework using optical Flow-guided Efficient Diffusion (FloED) for higher video coherence. Specifically, FloED employs a dual-branch architecture, where the time-agnostic flow branch restores corrupted flow first, and the multi-scale flow adapters provide motion guidance to the main inpainting branch. Besides, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. With the flow attention cache mechanism, FLoED efficiently reduces the computational cost of incorporating optical flow. Extensive experiments on background restoration and object removal tasks show that FloED outperforms state-of-the-art diffusion-based methods in both quality and efficiency. Our codes and models will be made publicly available.

Coherent Video Inpainting Using Optical Flow-Guided Efficient Diffusion

TL;DR

FloED tackles the challenge of text-guided video inpainting by integrating optical flow as a motion prior into diffusion models. It introduces a dual-branch architecture with a time-agnostic flow completion branch and multi-scale flow adapters, augmented by an anchor-frame strategy and training-free latency reductions (latent interpolation and flow attention caching). Empirical results on background restoration and object removal show FloED achieving state-of-the-art quality and efficiency, with strong temporal coherence and text alignment. The approach offers practical impact by enabling faster, more coherent diffusion-based video inpainting and provides a public benchmark and code base for further research.

Abstract

The text-guided video inpainting technique has significantly improved the performance of content generation applications. A recent family for these improvements uses diffusion models, which have become essential for achieving high-quality video inpainting results, yet they still face performance bottlenecks in temporal consistency and computational efficiency. This motivates us to propose a new video inpainting framework using optical Flow-guided Efficient Diffusion (FloED) for higher video coherence. Specifically, FloED employs a dual-branch architecture, where the time-agnostic flow branch restores corrupted flow first, and the multi-scale flow adapters provide motion guidance to the main inpainting branch. Besides, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. With the flow attention cache mechanism, FLoED efficiently reduces the computational cost of incorporating optical flow. Extensive experiments on background restoration and object removal tasks show that FloED outperforms state-of-the-art diffusion-based methods in both quality and efficiency. Our codes and models will be made publicly available.

Paper Structure

This paper contains 25 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of FloED. FloED employs a dual-branch architecture implemented through a two-stage training approach. In the first training stage, we focus exclusively on the upper branch, optimizing the motion layer to adapt specifically to the video inpainting domain. Subsequently, we introduce a time-agnostic flow branch complemented by a multi-scale flow adapter, which provides flow guidance covering upblocks of primary UNet. During the inference phase, we enhance efficiency by integrating the flow attention cache (right part).
  • Figure 2: Illustration of flow-guided latent interpolation (left) and warping operation (right) during the denoising process.
  • Figure 3: Qualitative comparisons. We compare FloED against diffusion-based SOTAs on BR and OR tasks.
  • Figure 4: Optical flow related ablation studies. (F) ablation study demonstrates FloED conducts flow warping at noise $\epsilon$ instead of $\mathbf{z}_0$.
  • Figure 5: We conduct reliable User Study with randomized order to assess inpainting outcomes of 4 methods.
  • ...and 1 more figures