Table of Contents
Fetching ...

Elevating Flow-Guided Video Inpainting with Reference Generation

Suhwan Cho, Seoung Wug Oh, Sangyoun Lee, Joon-Young Lee

TL;DR

This work tackles video inpainting by decoupling content propagation from generation, enabling high-quality, temporally consistent edits even at 2K resolutions. The proposed RGVI framework blends a one-shot pixel pulling propagation method with Stable Diffusion-based reference generation, guided by a key-frame selection strategy and reinforced by occlusion-aware masking and propagation verification. Key contributions include a novel propagation mechanism that avoids re-sampling artifacts, a diffusion-model-driven reference generation scheme, and the HQVI benchmark for realistic evaluation of VI methods. The results show substantial gains in perceptual quality and scalability, indicating RGVI's practical potential for real-world video editing tasks.

Abstract

Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. In this study, we propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Powered by a strong generative model, our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts. For pixel propagation, we introduce a one-shot pixel pulling method that effectively avoids error accumulation from repeated sampling while maintaining sub-pixel precision. To evaluate various VI methods in realistic scenarios, we also propose a high-quality VI benchmark, HQVI, comprising carefully generated videos using alpha matte composition. On public benchmarks and the HQVI dataset, our method demonstrates significantly higher visual quality and metric scores compared to existing solutions. Furthermore, it can process high-resolution videos exceeding 2K resolution with ease, underscoring its superiority for real-world applications.

Elevating Flow-Guided Video Inpainting with Reference Generation

TL;DR

This work tackles video inpainting by decoupling content propagation from generation, enabling high-quality, temporally consistent edits even at 2K resolutions. The proposed RGVI framework blends a one-shot pixel pulling propagation method with Stable Diffusion-based reference generation, guided by a key-frame selection strategy and reinforced by occlusion-aware masking and propagation verification. Key contributions include a novel propagation mechanism that avoids re-sampling artifacts, a diffusion-model-driven reference generation scheme, and the HQVI benchmark for realistic evaluation of VI methods. The results show substantial gains in perceptual quality and scalability, indicating RGVI's practical potential for real-world video editing tasks.

Abstract

Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. In this study, we propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Powered by a strong generative model, our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts. For pixel propagation, we introduce a one-shot pixel pulling method that effectively avoids error accumulation from repeated sampling while maintaining sub-pixel precision. To evaluate various VI methods in realistic scenarios, we also propose a high-quality VI benchmark, HQVI, comprising carefully generated videos using alpha matte composition. On public benchmarks and the HQVI dataset, our method demonstrates significantly higher visual quality and metric scores compared to existing solutions. Furthermore, it can process high-resolution videos exceeding 2K resolution with ease, underscoring its superiority for real-world applications.

Paper Structure

This paper contains 15 sections, 5 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Qualitative comparison between RGVI and state-of-the-art methods.
  • Figure 2: Overall pipeline of RGVI.
  • Figure 3: RGVI outputs from the generation mode.
  • Figure 4: Example videos from the HQVI dataset. Negative masks are highlighted in green, while positive masks are highlighted in red.
  • Figure 5: RGVI outputs on the video restoration scenarios.
  • ...and 2 more figures