Table of Contents
Fetching ...

Infusion: internal diffusion for inpainting of dynamic textures and complex motion

Nicolas Cherel, Andrés Almansa, Yann Gousseau, Alasdair Newson

TL;DR

Infusion tackles video inpainting for dynamic textures using diffusion models trained solely on the input video (internal learning). It introduces interval training to decompose the diffusion process into manageable learning stages and employs a lightweight 3D UNet (~500k parameters) to ensure temporal coherence without optical flow. The approach yields strong perceptual and spatio-temporal fidelity on dynamic textures, with feasible training/inference times on a single GPU, and demonstrates robustness to complex motions and occlusions. Limitations include static backgrounds and single-use models, motivating future work on acceleration, broader content coverage, and handling extreme camera motion.

Abstract

Video inpainting is the task of filling a region in a video in a visually convincing manner. It is very challenging due to the high dimensionality of the data and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Such models remain nonetheless very expensive to train and to perform inference with, which strongly reduce their applicability to videos, and yields unreasonable computational loads. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training data of a diffusion model can be restricted to the input video and still produce very satisfying results. With this internal learning approach, where the training data is limited to a single video, our lightweight models perform very well with only half a million parameters, in contrast to the very large networks with billions of parameters typically found in the literature. We also introduce a new method for efficient training and inference of diffusion models in the context of internal learning, by splitting the diffusion process into different learning intervals corresponding to different noise levels of the diffusion process. We show qualitative and quantitative results, demonstrating that our method reaches or exceeds state of the art performance in the case of dynamic textures and complex dynamic backgrounds

Infusion: internal diffusion for inpainting of dynamic textures and complex motion

TL;DR

Infusion tackles video inpainting for dynamic textures using diffusion models trained solely on the input video (internal learning). It introduces interval training to decompose the diffusion process into manageable learning stages and employs a lightweight 3D UNet (~500k parameters) to ensure temporal coherence without optical flow. The approach yields strong perceptual and spatio-temporal fidelity on dynamic textures, with feasible training/inference times on a single GPU, and demonstrates robustness to complex motions and occlusions. Limitations include static backgrounds and single-use models, motivating future work on acceleration, broader content coverage, and handling extreme camera motion.

Abstract

Video inpainting is the task of filling a region in a video in a visually convincing manner. It is very challenging due to the high dimensionality of the data and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Such models remain nonetheless very expensive to train and to perform inference with, which strongly reduce their applicability to videos, and yields unreasonable computational loads. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training data of a diffusion model can be restricted to the input video and still produce very satisfying results. With this internal learning approach, where the training data is limited to a single video, our lightweight models perform very well with only half a million parameters, in contrast to the very large networks with billions of parameters typically found in the literature. We also introduce a new method for efficient training and inference of diffusion models in the context of internal learning, by splitting the diffusion process into different learning intervals corresponding to different noise levels of the diffusion process. We show qualitative and quantitative results, demonstrating that our method reaches or exceeds state of the art performance in the case of dynamic textures and complex dynamic backgrounds
Paper Structure (19 sections, 6 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: We train a diffusion model on the input video only using interval training. In interval training, we learn on a subset of timesteps and the inference is done immediately after the training phase. Starting from $x_T$, the inpainting result is progressively generated with a perfectly adapted network.
  • Figure 2: In these scenes, our method satisfactorily inpaints dynamic textures preserving sharp details (top). It also better handles complex occlusions such as people crossing paths.
  • Figure 3: Interval training clearly improves the visual results, especially the finer details. Associated videos are found in the supplementary materials.
  • Figure 4: When generating content, our method may not guarantee long term temporal consistency, especially for textures.
  • Figure 5: On all these challenging textures, our method better inpaints the missing region. The motions are also much better inpainted (see video in the supplementary materials).
  • ...and 5 more figures