Table of Contents
Fetching ...

ATI: Any Trajectory Instruction for Controllable Video Generation

Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, Chongyang Ma

TL;DR

ATI presents a unified, trajectory-based approach to motion control in video generation by injecting user-defined point trajectories into the latent space of pretrained diffusion-based video models. A Gaussian feature model and tail dropout regularization enable fine-grained control over local, object-level, and camera motions without retraining base backbones, demonstrated on Seaweed-7B and Wan2.1-14B. Extensive experiments show improved controllability and visual quality over modular, prior methods and commercial systems, with practical training and inference times and an interactive trajectory editor for user-friendly design. The work highlights the versatility and compatibility of trajectory-based latent conditioning for integrated motion control in video synthesis.

Abstract

We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: https://anytraj.github.io/.

ATI: Any Trajectory Instruction for Controllable Video Generation

TL;DR

ATI presents a unified, trajectory-based approach to motion control in video generation by injecting user-defined point trajectories into the latent space of pretrained diffusion-based video models. A Gaussian feature model and tail dropout regularization enable fine-grained control over local, object-level, and camera motions without retraining base backbones, demonstrated on Seaweed-7B and Wan2.1-14B. Extensive experiments show improved controllability and visual quality over modular, prior methods and commercial systems, with practical training and inference times and an interactive trajectory editor for user-friendly design. The work highlights the versatility and compatibility of trajectory-based latent conditioning for integrated motion control in video synthesis.

Abstract

We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: https://anytraj.github.io/.

Paper Structure

This paper contains 15 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: ATI is able to generate a video given an initial frame (left) and a set of user-specified trajectories. Green dots denote the starting points, and red dots indicate the ending points of each trajectory. On the right, we show uniformly sampled frames from the generated video, with colored dots tracking the position of each trajectory point over time.
  • Figure 2: ATI takes an image and user specified trajectories as inputs. The point-wise trajectories are injected into the latent condition for the generation. Videos are decoded from the latent denoised from the DiT.
  • Figure 3: Trajectory Instruction module computes a latent feature from a point's trajectory. During inference, given the point's location in the first frame (i.e., the input image), we sample the feature at that location using bilinear interpolation. We then compute a spatial Gaussian distribution for each visible point on its corresponding location in every subsequent frame.
  • Figure 4: Object Motion Control. Left: the input image overlaid with user‑specified trajectories—green dots mark each trajectory's start point, and arrows mark each end point. Endpoint color encodes trajectory length, indicating that some trajectories span only part of the generated video. Right: five frames uniformly sampled from the generated video. Dot colors serve only to distinguish between trajectories.
  • Figure 5: Video generation results with camera control. Left: Input image superimposed with user specified trajectories. Right: Five frames uniformly sampled from the generated video.
  • ...and 2 more figures