Table of Contents
Fetching ...

Raformer: Redundancy-Aware Transformer for Video Wire Inpainting

Zhong Ji, Yimu Su, Yan Zhang, Jiacheng Hou, Yanwei Pang, Jungong Han

TL;DR

This work targets Video Wire Inpainting (VWI), a challenging post-production task plagued by long, slender wires that interact with actors and scenes. It introduces WRV2, a large unedited dataset with authentic and pseudo wire masks, and proposes Pseudo Wire-Shaped (PWS) masks to better mimic real wire occlusions. The Raformer model combines a Redundancy-Aware Attention (RAA) module and a Soft Feature Alignment (SFA) module within a transformer framework to selectively discard redundant features and precisely align non-redundant content, achieving state-of-the-art results on WRV2 and strong performance on DAVIS. The results demonstrate improved visual fidelity, fewer artifacts, and better perceptual quality, underscoring the practical impact for efficient and reliable wire removal in film and television post-production.

Abstract

Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods.

Raformer: Redundancy-Aware Transformer for Video Wire Inpainting

TL;DR

This work targets Video Wire Inpainting (VWI), a challenging post-production task plagued by long, slender wires that interact with actors and scenes. It introduces WRV2, a large unedited dataset with authentic and pseudo wire masks, and proposes Pseudo Wire-Shaped (PWS) masks to better mimic real wire occlusions. The Raformer model combines a Redundancy-Aware Attention (RAA) module and a Soft Feature Alignment (SFA) module within a transformer framework to selectively discard redundant features and precisely align non-redundant content, achieving state-of-the-art results on WRV2 and strong performance on DAVIS. The results demonstrate improved visual fidelity, fewer artifacts, and better perceptual quality, underscoring the practical impact for efficient and reliable wire removal in film and television post-production.

Abstract

Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods.
Paper Structure (26 sections, 16 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 16 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparative Illustration: (a) DAVIS DatasetDAVIS vs. (b) Our WRV2 Dataset. The upper panel displays typical object masks from the DAVIS dataset, while the lower panel highlights our dataset with a specific focus on wire masks, necessitating removal in film and television post-production.
  • Figure 2: The main idea of Raformer, where four frames are illustrated. The masked region includes the wires that require completion, the non-redundant patch contains valuable information for achieving video wire inpainting, and the redundant patch should be eliminated as it is unnecessary. It is worth noting that non-redundant patches are not fixedly limited to those that contain wires.
  • Figure 3: Overview of both the (a) WRV and (b) WRV2 datasets from three perspectives: View, Scenario, and ProdTech (an abbreviation for Production Technique).
  • Figure 4: Description of the dataset image. The first and third rows feature scene images, while the second and fourth rows present their corresponding mask annotations.
  • Figure 5: Illustration of different types of masks, including Polygonal Pseudo Masks, Authentic Wire Masks, and our proposed Pseudo Wire-Shaped Masks, with the latter showcasing masks produced under two different parameter settings.
  • ...and 4 more figures