Table of Contents
Fetching ...

FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

Zhiyuan Zhang, Can Wang, Dongdong Chen, Jing Liao

TL;DR

FlexTraj addresses controllability in diffusion-based image-to-video generation by introducing a unified point-trajectory representation that encodes each point as $p_i^t = (x_i^t, y_i^t, z_i^t, s_i, u_i, a_i)$. It projects trajectories into two conditioning videos, $V_{ID}$ and $V_{Color}$, processed by a pretrained video VAE to produce conditioning tokens, which are injected into a diffusion backbone via an efficient sequence-concatenation strategy with LoRA adaptation and a causal mask. A density and alignment annealing curriculum trains the model from complete to incomplete and finally unaligned supervision, enabling robust performance across dense, sparse, and unaligned inputs. Experiments on DAVIS and FlexBench demonstrate superior trajectory control (low TrajErr, high TrajSIM) while maintaining competitive video quality, enabling practical applications in motion cloning, interpolation, camera redirection, and mesh animation.

Abstract

We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.

FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

TL;DR

FlexTraj addresses controllability in diffusion-based image-to-video generation by introducing a unified point-trajectory representation that encodes each point as . It projects trajectories into two conditioning videos, and , processed by a pretrained video VAE to produce conditioning tokens, which are injected into a diffusion backbone via an efficient sequence-concatenation strategy with LoRA adaptation and a causal mask. A density and alignment annealing curriculum trains the model from complete to incomplete and finally unaligned supervision, enabling robust performance across dense, sparse, and unaligned inputs. Experiments on DAVIS and FlexBench demonstrate superior trajectory control (low TrajErr, high TrajSIM) while maintaining competitive video quality, enabling practical applications in motion cloning, interpolation, camera redirection, and mesh animation.

Abstract

We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.

Paper Structure

This paper contains 22 sections, 7 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: FlexTraj supports multi-granularity trajectory control, including dense (e.g., motion clone, camera redirection, mesh-to-video), spatially sparse (e.g., drag-to-video, partial mesh-to-video), and temporally sparse (e.g., motion interpolation---only provide motion on temporally sparse frames) settings. Also it allows unaligned control (e.g., flexible action control, coarse mesh-to-video). See project page: See project page: https://bestzzhang.github.io/FlexTraj.
  • Figure 2: Overview of the FlexTraj framework. Given 3D-tracking points annotated with TrackID, SegID, and optional Color, users can sparsify or shift trajectories to define spatially sparse, temporally sparse, or unaligned controls. These modified trajectories are projected into condition videos (ID-coded and color-cue) and combined with the first frame and text prompt as inputs to a video diffusion model via efficient sequence-concatenation.
  • Figure 3: Comparison of condition-injection frameworks. (a) ControlNet-Style condition injection. (b) Sequence-Concatenation condition injection. (c) Our Efficient Sequence-Concatenation with LoRA and masked attention. (d) Causal mask.
  • Figure 4: Qualitative comparison on dense control. MagicMotion li2025magicmotion and Go-with-the-Flow burgert2025go struggle with fine-grained details; DAS gu2025diffusion fails to handle newly emerging points, whereas our method closely follows the source motion.
  • Figure 5: Qualitative comparison on spatially sparse control. The subject outlined in green is occluded by the subject outlined in blue. 2D-based methods (MagicMotion li2025magicmotion, ToRA zhang2025tora) fail in handling occlusion, U-Net-based method LeviTor wang2025levitor introduces artifacts, while ours accurately captures occlusion with high visual fidelity.
  • ...and 5 more figures