Table of Contents
Fetching ...

LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

Zhihan Xiao, Lin Liu, Yixin Gao, Xiaopeng Zhang, Haoxuan Che, Songping Mai, Qi Tian

TL;DR

LoVoRA confronts the challenge of text-guided video editing without manual masks by introducing a learnable object-aware localization mechanism and a diffusion-based Diffusion Mask Predictor. The framework is trained on a purpose-built, temporally supervised dataset synthesized from NHR-Edit with optical-flow-guided mask propagation and VACE inpainting, enabling end-to-end, mask-free editing at inference. Empirical results demonstrate superior spatial precision, temporal stability, and alignment with textual prompts compared to strong baselines, supported by comprehensive ablations on both model components and dataset construction. The work offers a scalable path toward robust, instruction-driven video edits without auxiliary control signals, with an open-source dataset to foster further research.

Abstract

Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA. https://cz-5f.github.io/LoVoRA.github.io

LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

TL;DR

LoVoRA confronts the challenge of text-guided video editing without manual masks by introducing a learnable object-aware localization mechanism and a diffusion-based Diffusion Mask Predictor. The framework is trained on a purpose-built, temporally supervised dataset synthesized from NHR-Edit with optical-flow-guided mask propagation and VACE inpainting, enabling end-to-end, mask-free editing at inference. Empirical results demonstrate superior spatial precision, temporal stability, and alignment with textual prompts compared to strong baselines, supported by comprehensive ablations on both model components and dataset construction. The work offers a scalable path toward robust, instruction-driven video edits without auxiliary control signals, with an open-source dataset to foster further research.

Abstract

Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA. https://cz-5f.github.io/LoVoRA.github.io

Paper Structure

This paper contains 20 sections, 13 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Removal and addition samples from our LoVoRA dataset. The order from top to bottom is: source video, edited video, and editing instruction.
  • Figure 2: Overview of LoVoRA dataset construction pipeline. Starting from high-quality image editing pairs, we synthesize instruction-based video editing data through five: I2V translation, mask generation, optical flow estimation, mask propagation, and video inpainting.
  • Figure 3: Overall architecture. The input video is encoded by a spatio-temporal VAE to produce latents. Encoded latents are channel-concatenated with noisy target latents and processed by a DiT backbone to predict the rectified-flow velocity field. A Diffusion Mask Predictor reads selected DiT token features and predicts a spatio-temporal diff mask used during training.
  • Figure 4: Qualitative comparison on object removal and addition tasks. Each row presents input videos and the corresponding editing results produced by different methods. In contrast, LoVoRA accurately localizes the target regions, cleanly removes or seamlessly inserts objects, and preserves the original background content with stable temporal coherence.
  • Figure 5: Ablation study on the Diffusion Mask Predictor (DMP). Compared to the model trained without DMP, LoVoRA with DMP produces more accurate localization.
  • ...and 6 more figures