Table of Contents
Fetching ...

PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

Wonyong Seo, Jaeho Moon, Jaehyup Lee, Soo Ye Kim, Munchurl Kim

TL;DR

The PropFly is proposed, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.

Abstract

Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.

PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

TL;DR

The PropFly is proposed, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.

Abstract

Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.
Paper Structure (46 sections, 7 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 46 sections, 7 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Qualitative comparison of our PropFly against text-guided (STDF yatim2024space_stdf, TokenFlow tokenflow2023) and propagation-based (AnyV2V ku2024anyv2v, Señorita-2M zi2025senorita) video editing methods. Our PropFly demonstrates robust performance across a wide range of edits, from local editing to complex transformations. Note that all propagation-based methods were conditioned on the same edited frames (in red boxes).
  • Figure 2: An illustration of our on-the-fly data pair generation process based on one-step clean latent estimation. (a) Pre-trained VDM sampling process from intermediate noised latents $\mathbf{x}_t$ with an edited text prompt $\mathbf{c}_\text{aug}$, showing clean latent estimation after one-step sampling (Eq. \ref{['eq:onestep']}) and full sampling (an iterative ODE solve from $t$ to $0$). (b) Increasing the CFG scale ($\omega$) progressively strengthens the semantic edit (i.e., altering style, texture, and color). (c) Our method leverages this phenomenon efficiently: instead of performing computationally expensive full sampling, we utilize one-step clean latent predictions generated at a low CFG scale ($\omega_L$) and a high CFG scale ($\omega_H$). These on-the-fly predictions serve as the aligned source ($\hat{\mathbf{x}}_{0|t}^{\text{low}}$) and target ($\hat{\mathbf{x}}_{0|t}^{\text{high}}$) pair for training our PropFly.
  • Figure 3: Overview of our PropFly training pipeline. (a) A pair of video $\mathbf{x}_0$ and text prompt $\mathbf{c}_\text{text}$ is sampled from the video dataset and an augmented text $\mathbf{c}_\text{aug}$ is synthesized, by appending random style prompt $\mathbf{c}_\text{style}$ to $\mathbf{c}_\text{text}$. (b) A frozen, pre-trained VDM $\theta$ synthesizes a data pair ($\hat{\mathbf{x}}_{0|t}^{\text{low}}, \hat{\mathbf{x}}_{0|t}^{\text{high}}$) on the fly from a single noised latent $\mathbf{x}_t$ using low and high CFG scales (guided by $\mathbf{c}_\text{aug}$). (c) A trainable adapter $\phi$ with the frozen VDM $\theta$ is then conditioned on the source video latent $\hat{\mathbf{x}}_{0|t}^{\text{low}}$ (for structure) and the edited first frame latent of $\hat{\mathbf{x}}_{0|t}^{\text{high}}$. The adapter is trained via GMFM loss to predict the VDM's text-guided, high-CFG velocity, effectively learning to edit the remaining video frames.
  • Figure 4: Qualitative comparison against propagation-based baselines AnyV2V ku2024anyv2v and Señorita-2M zi2025senorita. Our PropFly successfully propagates diverse edits (including object, background, and style changes) while preserving the motion of the source videos. In contrast, the baseline methods often fail to propagate the edits accurately or introduce severe visual artifacts. Zoom in for better visualization.
  • Figure 5: Visual results showing the effect of our key components. (a) Baseline trained with full sampling fails to align object motion, while the baseline trained with the conventional FM objective fails to propagate the edit. (b) Baselines trained without our RSPF or with the paired dataset lack generalization, failing to perform complex edits. In contrast, our PropFly achieves robust propagation performance and high-fidelity edits. Zoom-in for details.
  • ...and 7 more figures