Table of Contents
Fetching ...

ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang

Abstract

Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.

ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

Abstract

Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.
Paper Structure (29 sections, 5 equations, 10 figures, 5 tables)

This paper contains 29 sections, 5 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The visualization results of our proposed method ViFeEdit. Our proposed method can adapt text-to-video DiTs to various video editing tasks without any video data. Here, we demonstrate our proposed method on six fine-grained video editing tasks, including style transfer, rigid replacement, non-rigid replacement, color alternation, object addition and object removal.
  • Figure 2: The architecture of DiT blocks of (a) the original text-to-video Wan2.1 model and (b) our proposed video-free tuner ViFeEdit for video editing and control tasks. Here, we enable text-to-video DiTs to handle diverse video editing and control tasks without any video data. Specifically, the source video $C_V$ is jointly fed into the model and interacts with the noisy video latent $Z$ in the 2D spatial attention branch, providing explicit reference guidance.
  • Figure 3: The visualization results of baselines and our proposed method on consistent style transfer tasks. $^*$ means that the pretrained model VACE is further finetuned on the paired image data for style transfer learning.
  • Figure 4: The visualization results of baselines and our proposed method on rigid and non-rigid replacement tasks. $^*$ means that the pretrained VACE is further finetuned on the paired image data for editing tasks.
  • Figure 5: The visualization results of baselines and our proposed method on color alternation, object addition and object removal tasks. $^*$ means that the pretrained VACE is further finetuned on the paired image data for editing tasks.
  • ...and 5 more figures