Table of Contents
Fetching ...

Generative Video Propagation

Shaoteng Liu, Tianyu Wang, Jui-Hsien Wang, Qing Liu, Zhifei Zhang, Joon-Young Lee, Yijun Li, Bei Yu, Zhe Lin, Soo Ye Kim, Jiaya Jia

TL;DR

GenProp addresses the problem of propagating a first-frame edit across an entire video by learning a generative propagation model that relies on a Selective Content Encoder and an Image-to-Video generator. It introduces a Selective Content Encoder (SCE) and a Mask Prediction Decoder (MPD) guided by a region-aware loss to propagate edits while preserving unedited regions, formalized by $v'_t = \mathcal{G}(\mathcal{E}(\hat{V}), v'_1, t)$ for $t \ge 2$. A synthetic data pipeline built from video instance segmentation datasets generates paired sequences for training, enabling insertion, removal, editing, outpainting, and tracking without dense per-frame masks. Across tasks, GenProp achieves state-of-the-art results on challenging edit and removal benchmarks and demonstrates robust tracking of edits and object effects, suggesting strong practical utility for flexible video editing.

Abstract

Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models. Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. We propose a data generation scheme to cover multiple video tasks based on instance-level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region-aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region. This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object's shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in-depth analyses of the proposed framework.

Generative Video Propagation

TL;DR

GenProp addresses the problem of propagating a first-frame edit across an entire video by learning a generative propagation model that relies on a Selective Content Encoder and an Image-to-Video generator. It introduces a Selective Content Encoder (SCE) and a Mask Prediction Decoder (MPD) guided by a region-aware loss to propagate edits while preserving unedited regions, formalized by for . A synthetic data pipeline built from video instance segmentation datasets generates paired sequences for training, enabling insertion, removal, editing, outpainting, and tracking without dense per-frame masks. Across tasks, GenProp achieves state-of-the-art results on challenging edit and removal benchmarks and demonstrates robust tracking of edits and object effects, suggesting strong practical utility for flexible video editing.

Abstract

Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models. Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. We propose a data generation scheme to cover multiple video tasks based on instance-level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region-aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region. This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object's shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in-depth analyses of the proposed framework.
Paper Structure (33 sections, 9 equations, 18 figures, 3 tables)

This paper contains 33 sections, 9 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: GenProp. We propose a generative video propagation framework (GenProp), which can seamlessly propagate any first frame edit through the video. GenProp supports a wide range of video applications, including (a) complete object removal with effects such as shadows and reflections, (b) background replacement with realistic effects, (c) object insertion where inserted objects have physically plausible motion (i.e., blueberries falling while spoon goes up), (d) tracking of objects and their associated effects, and (e) multiple edits (outpainting, insertion, removal) at a single inference run.
  • Figure 2: Model Overview. During inference, our framework takes in the original video as input through a selective content encoder (SCE) to retain content in unchanged regions. Changes applied to the first frame are propagated throughout the video using an I2V model while other regions remain intact.
  • Figure 3: Attention Map Visualization. We observe that the attention maps gradually focus on the regions to be removed and the I2V model is guided to generate new content in those regions.
  • Figure 4: Training Framework of GenProp. Our framework integrates a Selective Content Encoder and a Mask Prediction Decoder on top of the I2V generation model, enforcing the model to propagate the edited region while preserving the content in the original video for all other regions. With synthetic data augmentations and task embeddings, our model is trained to propagate various changes in the first frame.
  • Figure 5: Region-Aware Loss. This loss helps the model to disentangle the edited region from the original content.
  • ...and 13 more figures