Table of Contents
Fetching ...

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Haitao Zhou, Chuang Wang, Rui Nie, Jinlin Liu, Dongdong Yu, Qian Yu, Changhu Wang

TL;DR

TrackGo introduces a flexible, efficient framework for controllable video generation by converting user-drawn free-form masks and arrows into point trajectories and injecting them into temporal self-attention via a lightweight TrackAdapter. The TrackAdapter employs a dual-branch attention mechanism and an attention-mask strategy to isolate target motion while preserving background coherence, guided by an attention-loss to accelerate convergence. Empirical results on a high-quality internal dataset show state-of-the-art performance in $FVD$, $FID$, and $ObjMC$, with faster inference and fewer added parameters than baselines such as DragAnything and DragNUWA. The approach supports complex scenes with multiple objects and intricate motions and points toward practical diffusion-based controllable video synthesis, while acknowledging limitations in large motions and 3D rotation handling.

Abstract

Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores.

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

TL;DR

TrackGo introduces a flexible, efficient framework for controllable video generation by converting user-drawn free-form masks and arrows into point trajectories and injecting them into temporal self-attention via a lightweight TrackAdapter. The TrackAdapter employs a dual-branch attention mechanism and an attention-mask strategy to isolate target motion while preserving background coherence, guided by an attention-loss to accelerate convergence. Empirical results on a high-quality internal dataset show state-of-the-art performance in , , and , with faster inference and fewer added parameters than baselines such as DragAnything and DragNUWA. The approach supports complex scenes with multiple objects and intricate motions and points toward practical diffusion-based controllable video synthesis, while acknowledging limitations in large motions and 3D rotation handling.

Abstract

Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores.
Paper Structure (33 sections, 11 equations, 11 figures, 2 tables)

This paper contains 33 sections, 11 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Attention map visualization of the last temporal self-attention layer in Stable Video Diffusion Model. The highlighted areas in the attention map correspond to the moving areas in the video. The video has a total of 25 frames, and we selected frames 1, 12, and 23 at equal intervals for visualization. And $Attn(i,j)$ denotes the temporal attention map between frame $i$ and frame $j$.
  • Figure 2: Example videos generated by our proposed TrackGo. Given an initial frame, users specify the target moving object(s) or part(s) using free-form masks and indicate the desired movement trajectory with arrows. TrackGo is capable of generating subsequent video frames with precise control. It can handle complex scenarios that involve multiple objects, fine-grained object parts, and sophisticated movement trajectories.
  • Figure 3: Top: Pipeline of Point Trajectories Generation. User's inputs are divided into masks and trajectory vectors for processing. Each mask corresponds to a trajectory vector. For each mask area, $K*s$ points are randomly selected. The trajectory vector is then subdivided by the frame number to attain the relative displacement $\mathcal{T}$ of each point between frames between adjacent frames. The final step is to combine this relevant data to construct point trajectories. Bottom: Overview of TrackGo. TrackGo generates videos by taking user input $\bm{I}$ and latent input $\bm{z}_t$ as inputs based on an image-to-video diffusion model. Through the pipeline of point trajectories generation, point trajectories $\bm{P}$ can be obtained from $\bm{I}$. Then the point trajectories $\bm{P}$ are passed through the Encoder $\mathcal{E}$ and injected into the model via the TrackAdapter. $\textbf{Architecture of TrackAdapter}$ describes the calculation process of TrackAdapter.
  • Figure 4: Qualitative comparisons between our method and baseline methods, DragAnything and DragNUWA. We use colorful symbols to highlight the undesired parts of the results generated by the other two approaches.
  • Figure 5: Comparison results of unspecified area suppression intensity $\tau$. The top row shows the results of the last frame generated for various $\tau$. The bottom row provides a magnified view, highlighting the differences more clearly with red boxes, to better observe the variations.
  • ...and 6 more figures