Table of Contents
Fetching ...

Neural Video Fields Editing

Shuzhou Yang, Chong Mou, Jiwen Yu, Yuhan Wang, Xiandong Meng, Jian Zhang

TL;DR

NVEdit tackles memory and temporal inconsistency in long-video editing by learning a Neural Video Field (NVF) with tri-plane encoding and sparse grids to capture temporal priors, followed by editing via off-the-shelf diffusion-based T2I models guided by user prompts. The two-stage pipeline uses random-pixel video fitting to learn priors, then frame-wise editing with pseudo-ground-truths and a progressive optimization to preserve temporal coherence; an auxiliary IP2P+ mask further enhances local editing precision. Results show NVEdit can edit hundreds of frames with high inter-frame consistency and supports frame interpolation without fine-tuning, while remaining memory-efficient relative to frame-based diffusion methods. The framework is modular, allowing replacement or upgrading of both NVF components and the T2I models, enabling flexible adaptation to diverse editing tasks and future research directions.

Abstract

Diffusion models have revolutionized text-driven video editing. However, applying these methods to real-world editing encounters two significant challenges: (1) the rapid increase in GPU memory demand as the number of frames grows, and (2) the inter-frame inconsistency in edited videos. To this end, we propose NVEdit, a novel text-driven video editing framework designed to mitigate memory overhead and improve consistent editing for real-world long videos. Specifically, we construct a neural video field, powered by tri-plane and sparse grid, to enable encoding long videos with hundreds of frames in a memory-efficient manner. Next, we update the video field through off-the-shelf Text-to-Image (T2I) models to impart text-driven editing effects. A progressive optimization strategy is developed to preserve original temporal priors. Importantly, both the neural video field and T2I model are adaptable and replaceable, thus inspiring future research. Experiments demonstrate the ability of our approach to edit hundreds of frames with impressive inter-frame consistency. Our project is available at: https://nvedit.github.io/.

Neural Video Fields Editing

TL;DR

NVEdit tackles memory and temporal inconsistency in long-video editing by learning a Neural Video Field (NVF) with tri-plane encoding and sparse grids to capture temporal priors, followed by editing via off-the-shelf diffusion-based T2I models guided by user prompts. The two-stage pipeline uses random-pixel video fitting to learn priors, then frame-wise editing with pseudo-ground-truths and a progressive optimization to preserve temporal coherence; an auxiliary IP2P+ mask further enhances local editing precision. Results show NVEdit can edit hundreds of frames with high inter-frame consistency and supports frame interpolation without fine-tuning, while remaining memory-efficient relative to frame-based diffusion methods. The framework is modular, allowing replacement or upgrading of both NVF components and the T2I models, enabling flexible adaptation to diverse editing tasks and future research directions.

Abstract

Diffusion models have revolutionized text-driven video editing. However, applying these methods to real-world editing encounters two significant challenges: (1) the rapid increase in GPU memory demand as the number of frames grows, and (2) the inter-frame inconsistency in edited videos. To this end, we propose NVEdit, a novel text-driven video editing framework designed to mitigate memory overhead and improve consistent editing for real-world long videos. Specifically, we construct a neural video field, powered by tri-plane and sparse grid, to enable encoding long videos with hundreds of frames in a memory-efficient manner. Next, we update the video field through off-the-shelf Text-to-Image (T2I) models to impart text-driven editing effects. A progressive optimization strategy is developed to preserve original temporal priors. Importantly, both the neural video field and T2I model are adaptable and replaceable, thus inspiring future research. Experiments demonstrate the ability of our approach to edit hundreds of frames with impressive inter-frame consistency. Our project is available at: https://nvedit.github.io/.
Paper Structure (14 sections, 8 equations, 8 figures, 3 tables)

This paper contains 14 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: NVEdit enables various editing options, including shape variation, scene change, and style transfer, while preserving original motion and semantic layout. Due to its efficient encoding rate, long videos with hundreds of frames even be edited well.
  • Figure 2: Workflow of our NVEdit. It contains two stages: video fitting and field editing. As shown in the left part, we first train a neural video field to fit a given video for temporal priors. Then, on the right, the rendered frame is edited and used to optimize the trained field to impart editing effects. As the video field has learned temporal priors, the optimized field consistently renders a video with edited content.
  • Figure 3: Visualization of the original image, the results of IP2P, the proposed auxiliary mask, and our edited results (IP2P+).
  • Figure 4: Visual comparison between our method and other SOTA approaches. One can see that IP2P fails to output consistent results, such as the differences in the regions pointed by arrows. Other methods either distort the shape, mistake editing regions, or fail to respond to varying viewpoints. Our approach not only generates temporally coherent content but also controls the area to be edited precisely.
  • Figure 5: Comparison of GPU memory overhead. As frames increase, our method only adds minimal memory overhead, which allows it to edit long videos. Although CoDeF CoDeF consumes less GPU memory, NVEdit achieves better performance as illustrated in \ref{['compare']} and it is still memory-efficient compared to most existing methods.
  • ...and 3 more figures