Table of Contents
Fetching ...

RNA: Video Editing with ROI-based Neural Atlas

Jaekyeong Lee, Geonung Kim, Sunghyun Cho

TL;DR

This work addresses the challenge of temporally consistent video editing by introducing ROI-based Neural Atlas (RNA), which edits a user-specified region using a single, soft, ROI-centric atlas and avoids foreground segmentation. RNA jointly learns MLP-based mappings for the ROI (mask), atlas, and geometry, guided by a composite self-supervised loss that enforces reconstruction, rigidity, and temporal consistency, while a mask refinement step resolves occlusion-related errors. A soft neural atlas model and matting-based blending enable high-quality edge transitions between edited ROI content and occluding objects, reducing artifacts such as ghosting and boundary darkening. The proposed approach delivers practical, efficient editing across videos with complex motion and multiple moving objects, demonstrated through quantitative and qualitative experiments on challenging scenes. The result is a flexible, user-friendly editing framework that scales with scene complexity and offers improved reconstruction and rendering quality compared with prior atlas- and propagation-based methods.

Abstract

With the recent growth of video-based Social Network Service (SNS) platforms, the demand for video editing among common users has increased. However, video editing can be challenging due to the temporally-varying factors such as camera movement and moving objects. While modern atlas-based video editing methods have addressed these issues, they often fail to edit videos including complex motion or multiple moving objects, and demand excessive computational cost, even for very simple edits. In this paper, we propose a novel region-of-interest (ROI)-based video editing framework: ROI-based Neural Atlas (RNA). Unlike prior work, RNA allows users to specify editing regions, simplifying the editing process by removing the need for foreground separation and atlas modeling for foreground objects. However, this simplification presents a unique challenge: acquiring a mask that effectively handles occlusions in the edited area caused by moving objects, without relying on an additional segmentation model. To tackle this, we propose a novel mask refinement approach designed for this specific challenge. Moreover, we introduce a soft neural atlas model for video reconstruction to ensure high-quality editing results. Extensive experiments show that RNA offers a more practical and efficient editing solution, applicable to a wider range of videos with superior quality compared to prior methods.

RNA: Video Editing with ROI-based Neural Atlas

TL;DR

This work addresses the challenge of temporally consistent video editing by introducing ROI-based Neural Atlas (RNA), which edits a user-specified region using a single, soft, ROI-centric atlas and avoids foreground segmentation. RNA jointly learns MLP-based mappings for the ROI (mask), atlas, and geometry, guided by a composite self-supervised loss that enforces reconstruction, rigidity, and temporal consistency, while a mask refinement step resolves occlusion-related errors. A soft neural atlas model and matting-based blending enable high-quality edge transitions between edited ROI content and occluding objects, reducing artifacts such as ghosting and boundary darkening. The proposed approach delivers practical, efficient editing across videos with complex motion and multiple moving objects, demonstrated through quantitative and qualitative experiments on challenging scenes. The result is a flexible, user-friendly editing framework that scales with scene complexity and offers improved reconstruction and rendering quality compared with prior atlas- and propagation-based methods.

Abstract

With the recent growth of video-based Social Network Service (SNS) platforms, the demand for video editing among common users has increased. However, video editing can be challenging due to the temporally-varying factors such as camera movement and moving objects. While modern atlas-based video editing methods have addressed these issues, they often fail to edit videos including complex motion or multiple moving objects, and demand excessive computational cost, even for very simple edits. In this paper, we propose a novel region-of-interest (ROI)-based video editing framework: ROI-based Neural Atlas (RNA). Unlike prior work, RNA allows users to specify editing regions, simplifying the editing process by removing the need for foreground separation and atlas modeling for foreground objects. However, this simplification presents a unique challenge: acquiring a mask that effectively handles occlusions in the edited area caused by moving objects, without relying on an additional segmentation model. To tackle this, we propose a novel mask refinement approach designed for this specific challenge. Moreover, we introduce a soft neural atlas model for video reconstruction to ensure high-quality editing results. Extensive experiments show that RNA offers a more practical and efficient editing solution, applicable to a wider range of videos with superior quality compared to prior methods.

Paper Structure

This paper contains 31 sections, 15 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: (b) Our video editing achieves natural editing outcomes, successfully considering the occlusions from the thin chain (2nd row) and the toy horse of the carousel (3rd row). In contrast, (c) Hashing NVD hashingnvd results in ghosting artifacts, and (d) CoDeF codef neglects the occlusion from moving objects, failing to produce natural editing results.
  • Figure 2: Overall framework of RNA. For video editing, (a) a user selects a reference frame from an input video and specifies an ROI where they want to edit. (b) For the specified ROI, our method estimates a 2D atlas representing its temporally-invariant appearance. (c) Then, the user edits the 2D atlas. (d) Finally, an edited video is reconstructed from the edited atlas and the input video.
  • Figure 3: (b) Atlas estimation with $\mathcal{L}_{pos}$ provides a user-friendly interface for editing, (c) while, without $\mathcal{L}_{pos}$, the estimated 2D atlas can be severely distorted, leading to less intuitive editing.
  • Figure 4: Magnified masks and editing results with and without mask refinement. Without the mask refinement, the mask inaccuracy is significant at the boundaries of occluding object.
  • Figure 5: A point $p$, which is outside of the ROI, is mapped into a plausible position within the ROI by $\mathbb{T}$.
  • ...and 10 more figures