Table of Contents
Fetching ...

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Haoyu Ma, Shahin Mahdizadehaghdam, Bichen Wu, Zhipeng Fan, Yuchao Gu, Wenliang Zhao, Lior Shapira, Xiaohui Xie

TL;DR

MaskINT addresses the heavy computation of diffusion-based text-driven video editing by decoupling the task into two stages: (1) zero-shot joint editing of two keyframes using a pre-trained text-to-image diffusion model, and (2) structure-aware, non-autoregressive masked transformer interpolation to generate intermediate frames guided by structural cues. The interpolation stage uses a dual-token representation (color and structure) with window-restricted attention and MTM training on video-only data, enabling fast, parallel frame generation. The approach achieves comparable temporal coherence and text alignment to diffusion-based methods while delivering significant speedups (roughly 5–7× faster) and scalability to longer videos via segment-wise processing. This work highlights the potential of masked transformers for efficient, structure-guided video editing and offers a practical path toward real-time text-based video manipulation without large paired datasets.

Abstract

Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in real applications. To address these issues, this paper breaks down the text-based video editing task into two stages. First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit few keyframes in an zero-shot way. Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the edited keyframes, using the structural guidance from intermediate frames. Experimental results suggest that our MaskINT achieves comparable performance with diffusion-based methodologies, while significantly improve the inference time. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

TL;DR

MaskINT addresses the heavy computation of diffusion-based text-driven video editing by decoupling the task into two stages: (1) zero-shot joint editing of two keyframes using a pre-trained text-to-image diffusion model, and (2) structure-aware, non-autoregressive masked transformer interpolation to generate intermediate frames guided by structural cues. The interpolation stage uses a dual-token representation (color and structure) with window-restricted attention and MTM training on video-only data, enabling fast, parallel frame generation. The approach achieves comparable temporal coherence and text alignment to diffusion-based methods while delivering significant speedups (roughly 5–7× faster) and scalability to longer videos via segment-wise processing. This work highlights the potential of masked transformers for efficient, structure-guided video editing and offers a practical path toward real-time text-based video manipulation without large paired datasets.

Abstract

Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in real applications. To address these issues, this paper breaks down the text-based video editing task into two stages. First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit few keyframes in an zero-shot way. Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the edited keyframes, using the structural guidance from intermediate frames. Experimental results suggest that our MaskINT achieves comparable performance with diffusion-based methodologies, while significantly improve the inference time. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.
Paper Structure (38 sections, 4 equations, 9 figures, 4 tables)

This paper contains 38 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Examples of video editing with MaskINT.
  • Figure 1: Examples of failure cases.
  • Figure 2: Overview of MaskINT. MaskINT disentangle the video editing task into two separate stages, i.e., keyframes joint editing and structure-aware frame interpolation.
  • Figure 2: Additional Qualitative comparisons with diffusion-based methods. Frames with red bounding box are jointly edited keyeframes.
  • Figure 3: Examples of video editing with MaskINT. Frames with red bounding box are jointly edited keyeframes.
  • ...and 4 more figures