Table of Contents
Fetching ...

DynVFX: Augmenting Real Videos with Dynamic Content

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Tali Dekel

TL;DR

DynVFX tackles the challenge of augmenting real videos with newly generated dynamic content described by a text instruction, without requiring per-frame references or fine-tuning. It combines a pre-trained text-to-video diffusion model (DiT) with a vision-language model, guided by a novel Anchor Extended Attention mechanism that injects sparse anchors from the original scene to localize edits, and an iterative refinement loop to ensure pixel-level harmonization. A VLM-based VFX assistant interprets instructions and generates scene prompts and object inventories, which are used to steer generation and segmentation-based masking for blending. Across 57 video-text edits on 34 real videos, DynVFX achieves superior edit fidelity and content integration compared with strong baselines, demonstrating robust handling of camera motion, occlusions, and complex interactions, with ablations confirming the critical roles of AnchorExtAttn and iterative refinement.

Abstract

We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.

DynVFX: Augmenting Real Videos with Dynamic Content

TL;DR

DynVFX tackles the challenge of augmenting real videos with newly generated dynamic content described by a text instruction, without requiring per-frame references or fine-tuning. It combines a pre-trained text-to-video diffusion model (DiT) with a vision-language model, guided by a novel Anchor Extended Attention mechanism that injects sparse anchors from the original scene to localize edits, and an iterative refinement loop to ensure pixel-level harmonization. A VLM-based VFX assistant interprets instructions and generates scene prompts and object inventories, which are used to steer generation and segmentation-based masking for blending. Across 57 video-text edits on 34 real videos, DynVFX achieves superior edit fidelity and content integration compared with strong baselines, demonstrating robust handling of camera motion, occlusions, and complex interactions, with ablations confirming the critical roles of AnchorExtAttn and iterative refinement.

Abstract

We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.

Paper Structure

This paper contains 25 sections, 3 equations, 18 figures, 2 tables, 1 algorithm.

Figures (18)

  • Figure 1: Pipeline.Top (pre-processing). Given an input video $\mathcal{V}_{\text{orig}}$ and instruction $\mathcal{P}_{\text{VFX}}$, we (i) Apply our VLM protocol for instruction interpretation, to yield a comprehensive scene description $\mathcal{P}_{\text{comp}}$, original objects $\mathcal{O}_{\text{orig}}$ and target object $\mathcal{O}_{\text{edit}}$ descriptions. (ii) DDIM invert $\mathcal{V}_{\text{orig}}$ to extract spatiotemporal keys/values $[\mathbf{K}_\text{orig}, \mathbf{V}_\text{orig}]$. Bottom (editing). We initialize the composed latent with $\boldsymbol{x}_{\text{comp}}=\boldsymbol{x}_{\text{orig}}$ and iterate over a list of descending noise levels $t=\tilde{T}\!\rightarrow\!T_{\min}$ used for noising $\boldsymbol{x}_{\text{comp}}$. At each iteration $t$ we: (i) noise $\boldsymbol{x}_{\text{comp}}$ to noise level $t$, and sample with Anchored Extended Attention, to output $\boldsymbol{\hat{x}_{\text{comp}}}$. (ii) Update $\boldsymbol{x}_{\text{comp}}$ within the new contents masked regions $\boldsymbol{{M}_{\text{VFX}}}$ by adding the residual $\boldsymbol{x}_{\text{res}}=\boldsymbol{{M}_{\text{VFX}}\cdot(\hat{x}_{\text{comp}}-x_{\text{orig}}})$ to $\boldsymbol{x}_{\text{orig}}$. Repeating this loop gradually integrates the new content, yielding the edited video $\mathcal{V}_{\text{comp}}$.
  • Figure 2: Controlling Fidelity to the Original Scene Using Different Extended Attention Mechanisms. (a-b) SDEdit suffers from the original scene preservation/edit fidelity trade-off. (c-e) Three Extended Attention variants during sampling demonstrate different control levels: Full Extended Attention closely reconstructs the input scene, Masked Extended Attention proves too constrained in overlapping regions despite allowing new content emergence, and our Anchor Extended Attention. achieves optimal results by applying dropout -- extending attention only at sparse points within selected regions.
  • Figure 3: Ablations. (b) Excluding both AnchorExtAttn and the Iterative refinement process results in significant misalignment with the original scene and poor harmonization (e.g., the size of the puppy relative to the scene and boundary artifacts). (c) Omitting AnchorExtAttn leads to incorrect positioning of the new content. (d) Removing iterative refinement results in poor harmonization. Our full method (e) exhibits good localization and harmonization of the edit.
  • Figure 4: Sample Results of DynVFX. Our method supports a wide range of scene augmentations across diverse scenarios while maintaining realistic interaction, occlusion, lighting, and camera motion, for example: a golden retriever consistent with camera movement, transparent wings revealing the woman’s silhouette at sunset, and a tsunami flooding the city yet realistically respecting the car dashboard. See SM for full videos.
  • Figure 5: Qualitative Comparison of Text-Based Methods. Sample results comparing our method to SDEdit meng2022sdedit, DDIM inversion song2020_ddim, Lora fine-tuning lora, Gen-3 gen3 and FlowEdit kulikov2024flowedit. As can be seen, our method better augments the original scene with new dynamic content that interacts naturally with existing elements in the scene. See SM for full video comparison.
  • ...and 13 more figures