Table of Contents
Fetching ...

SketchVideo: Sketch-based Video Generation and Editing

Feng-Lin Liu, Hongbo Fu, Xintao Wang, Weicai Ye, Pengfei Wan, Di Zhang, Lin Gao

TL;DR

SketchVideo addresses the challenge of fine-grained geometry and motion control in video generation and editing using sparse sketches. It introduces a sketch-conditioned DiT-based backbone with five distributed sketch control blocks and an inter-frame attention mechanism to propagate keyframe sketches across frames, plus a video insertion module and latent fusion to preserve unedited content. The approach demonstrates superior generation and editing performance against strong baselines, with improved sketch fidelity, temporal coherence, and seamless integration of edited regions. This enables interactive, geometry-aware video creation and editing with practical implications for content creation and video editing pipelines.

Abstract

Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing.

SketchVideo: Sketch-based Video Generation and Editing

TL;DR

SketchVideo addresses the challenge of fine-grained geometry and motion control in video generation and editing using sparse sketches. It introduces a sketch-conditioned DiT-based backbone with five distributed sketch control blocks and an inter-frame attention mechanism to propagate keyframe sketches across frames, plus a video insertion module and latent fusion to preserve unedited content. The approach demonstrates superior generation and editing performance against strong baselines, with improved sketch fidelity, temporal coherence, and seamless integration of edited regions. This enables interactive, geometry-aware video creation and editing with practical implications for content creation and video editing pipelines.

Abstract

Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing.

Paper Structure

This paper contains 14 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Our method enables high-quality video generation (a) and editing (b) based on sketch and text inputs. (a) Top: With the same text prompt, different keyframe sketches lead to results with similar semantics but diverse sketch-faithful geometry. (a) Bottom: With the same sketches, varied text prompts yield diverse appearances. (b) Users can also edit real videos by drawing on keyframe sketches, with edits automatically propagated even when edited objects in original videos have translation and rotation.
  • Figure 2: Our framework for sketch-based video generation and editing. (a) Our sketch condition network for the DiT-based video generation architecture has a skip structure and five sketch control blocks that predict residual features. (b) For generation, features are extracted from temporally sparse input sketches and propagated through inter-frame attention. The input sketches are provided for one or two keyframes (the second sketch is shown as a dotted line). In the top left corner of (b), the prompt and timestep inputs are shown. (c) For editing, the same sketch control block (b) is utilized, with an additional video insertion module and video masks $M^{1:N}$ to analyze the relationship between edited and unedited regions. The 3D causal VAE is omitted to save space.
  • Figure 3: The sketch-based video generation results. Left: The input text prompts and sketches. Right: The video generation results. It can be seen that the generated results show high quality and good faithfulness with the input sketches. Our method can handle one/two keyframe sketch(es) at arbitrary user-specified time points (the frames corresponding to the input time points are highlighted by orange).
  • Figure 4: The sketch-based video editing results. For each example, the text prompts and sketches are shown on the left. On the right, the input real videos are shown at the top, while the edited results with the control keyframe highlighted in orange are shown at the bottom. The editing region masks are manually provided by users, highlighted as orange boxes. Our method generates realistic local editing results.
  • Figure 5: The comparison results of sketch-based video generation. Text prompts are shown on the top. On the left, we show the input sketches and sketch-based image generation results by ControlNet ControlNet. On the right, we show the results of the compared approaches, including AMT AMT, SparseCtrl SparseCtrl, Ctrl-CogVideo ControlNet, and ours. Our method produces better results, especially for the intermediate frames.
  • ...and 3 more figures