TrackGo: A Flexible and Efficient Method for Controllable Video Generation
Haitao Zhou, Chuang Wang, Rui Nie, Jinlin Liu, Dongdong Yu, Qian Yu, Changhu Wang
TL;DR
TrackGo introduces a flexible, efficient framework for controllable video generation by converting user-drawn free-form masks and arrows into point trajectories and injecting them into temporal self-attention via a lightweight TrackAdapter. The TrackAdapter employs a dual-branch attention mechanism and an attention-mask strategy to isolate target motion while preserving background coherence, guided by an attention-loss to accelerate convergence. Empirical results on a high-quality internal dataset show state-of-the-art performance in $FVD$, $FID$, and $ObjMC$, with faster inference and fewer added parameters than baselines such as DragAnything and DragNUWA. The approach supports complex scenes with multiple objects and intricate motions and points toward practical diffusion-based controllable video synthesis, while acknowledging limitations in large motions and 3D rotation handling.
Abstract
Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores.
