Table of Contents
Fetching ...

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

Tianlin Pan, Jiayi Dai, Chenpu Yuan, Zhengyao Lv, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu, Caifeng Shan, Chenyang Si

TL;DR

This paper proposes NOVA: Sparse Control \&Dense Synthesis, a new framework for unpaired video editing that outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

Abstract

Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

TL;DR

This paper proposes NOVA: Sparse Control \&Dense Synthesis, a new framework for unpaired video editing that outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.

Abstract

Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.
Paper Structure (18 sections, 6 equations, 15 figures, 2 tables, 2 algorithms)

This paper contains 18 sections, 6 equations, 15 figures, 2 tables, 2 algorithms.

Figures (15)

  • Figure 1: We propose Sparse Control, Dense Synthesis, a multi-task video editing framework using sparse user-provided keyframe edits and dense information derived from the original video.
  • Figure 2: Chanllenges in Local Editing. Exisiting general video editing methods (e.g. VACE VACE) and datasets (e.g. Senorita-2M Senorita-2M) perform well on global editing tasks but struggle with local editing, often producing artifacts and inconsistent edits in the targeted areas.
  • Figure 3: The Limitations of Existing Schemes Previous methods are often limited by either requiring costly per-video finetuning I2VEditLoRA-Edit or pre-training on large-scale paired video data Senorita-2MVACE, which is difficult to acquire. Our method decouples control and synthesis signals, enabling a self-supervised framework that learns from unpaired data while maintaining high fidelity to the source video.
  • Figure 4: Inconsistent backgrounds in naive multi-keyframe approach We designate frames 0, 20, 40, 60, and 80 as edited keyframes (anchors). In non-keyframes reconstructed background exhibits inconsistent textures (e.g., on the building wall) and implausible motion (e.g., in the trees), as the model hallucinates content without access to the original video.
  • Figure 5: Training Pipeline. (1) Center (Model Architecture): The core model learns to denoise the source video by processing conditional inputs through a Sparse Branch $\mathcal{S}$ and a Dense Branch $\mathcal{D}$, which interact via cross-attention; (2) Left (Anchored Control Pipe): A degraded reference video is generated by linearly interpolating between sparsely selected keyframes, providing sparse temporal control; (3) Right (Source Fidelity Pipe): A synthetic edited video is created using a cut-and-paste method to simulate realistic artifacts, serving as a dense synthesis target.
  • ...and 10 more figures