Raformer: Redundancy-Aware Transformer for Video Wire Inpainting
Zhong Ji, Yimu Su, Yan Zhang, Jiacheng Hou, Yanwei Pang, Jungong Han
TL;DR
This work targets Video Wire Inpainting (VWI), a challenging post-production task plagued by long, slender wires that interact with actors and scenes. It introduces WRV2, a large unedited dataset with authentic and pseudo wire masks, and proposes Pseudo Wire-Shaped (PWS) masks to better mimic real wire occlusions. The Raformer model combines a Redundancy-Aware Attention (RAA) module and a Soft Feature Alignment (SFA) module within a transformer framework to selectively discard redundant features and precisely align non-redundant content, achieving state-of-the-art results on WRV2 and strong performance on DAVIS. The results demonstrate improved visual fidelity, fewer artifacts, and better perceptual quality, underscoring the practical impact for efficient and reliable wire removal in film and television post-production.
Abstract
Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods.
