INVE: Interactive Neural Video Editing

Jiahui Huang; Leonid Sigal; Kwang Moo Yi; Oliver Wang; Joon-Young Lee

INVE: Interactive Neural Video Editing

Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee

TL;DR

Interactive Neural Video Editing (INVE) addresses the challenge of real-time, consistent propagation of single-frame edits across a video. It builds on Layered Neural Atlases by adding a bidirectional atlas–frame mapping, vectorized sketching, and hash-grid encodings to accelerate learning and inference. The method introduces inverse mapping per atlas layer for robust point tracking and enables layered editing with sketch, texture, and metadata layers. Empirical results show significant speedups (5×) and rendering speeds around 25 FPS on a RTX 4090, with improved editability over LNA, demonstrated on DAVIS and custom videos.

Abstract

We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases, including direct frame editing and rigid texture tracking. To address these challenges we leverage and adopt highly efficient network architectures, powered by hash-grids encoding, to substantially improve processing speed. In addition, we learn bi-directional functions between image-atlas and introduce vectorized editing, which collectively enables a much greater variety of edits in both the atlas and the frames directly. Compared to LNA, our INVE reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot. We showcase the superiority of INVE over LNA in interactive video editing through a comprehensive quantitative and qualitative analysis, highlighting its numerous advantages and improved performance. For video results, please see https://gabriel-huang.github.io/inve/

INVE: Interactive Neural Video Editing

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 9 figures)

This paper contains 21 sections, 6 equations, 9 figures.

introduction
Related Works
Video Effects Editing
Video Propagation
Implicit Neural Representation
Implicit Neural Representation
Interactive Neural Video Editing (INVE)
Review of Layered Neural Atlases
Boosted Training & Inference Speed
Inverse Mapping for point tracking on videos
Layered Editing
Vectorized Sketching
Implementation Details
Early Stopping.
Details.
...and 6 more sections

Figures (9)

Figure 1: NeViE can propagate multiple types of image editing effects to the entire video in a consistent manner. In this case, the edits consist of (1) adding external graphics (dog picture) to the jeep; (2) Applying local adjustments (Hue -20, Brightness +10)) to the forest in the background; (3) Sketching on the road using the brush tool. All these types of edits can be propagated instantly from one frame to all other frames using the proposed approach.
Figure 2: Our forward mapping pipeline (solid lines) closely follows LNA's approach. Each video pixel location $(x, y, t)$ is fed into two mapping networks, $\mathbb{M}_f, \mathbb{M}_b$ to predict $(u, v)$ coordinates on each atlas. Then these coordinates are fed into the atlas network $\mathbb{A}$ to predict the RGB color on that atlas. Finally, we use the opacity value $\alpha$ predicted by the alpha network $\mathbb{M}_a$ to compose the reconstructed color at location $(x,y,t)$. Our backward mapping pipeline (dotted lines) maps atlas coordinates to video coordinates, it takes an $(u, v)$ coordinate, as well as the target frame index $t$ as input, and predicts the pixel location $(x,y,t)$. With the forward and backward pipelines combined, we can achieve long-range point tracking on videos.
Figure 3: Convergence Speed Comparison. Given the same number of training iterations, both reconstruction quality (measured by the reconstruction loss) and mapping accuracy (measured by the optical flow loss) of our model converges faster than LNA's.
Figure 4: Vectoriezed Sketching. User sketches directly on the frame, the mouse tracks $\left\{(x_{i}, y_{i})\right\}$ that define these sketches will be mapped to atlas coordinates $\left\{(u_{i}, v_{i})\right\}$, then these tracks will be used to render polylines on the atlas edit layer.
Figure 5: Our vectorized sketching allows users to perform sketch editing directly on frames free from resampling artifacts (left), whereas frame editing using LNA's pipeline either results in inconsistent color (middle) or noncontinuous sketches (right).
...and 4 more figures

INVE: Interactive Neural Video Editing

TL;DR

Abstract

INVE: Interactive Neural Video Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (9)