FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
Zhiyuan Zhang, Can Wang, Dongdong Chen, Jing Liao
TL;DR
FlexTraj addresses controllability in diffusion-based image-to-video generation by introducing a unified point-trajectory representation that encodes each point as $p_i^t = (x_i^t, y_i^t, z_i^t, s_i, u_i, a_i)$. It projects trajectories into two conditioning videos, $V_{ID}$ and $V_{Color}$, processed by a pretrained video VAE to produce conditioning tokens, which are injected into a diffusion backbone via an efficient sequence-concatenation strategy with LoRA adaptation and a causal mask. A density and alignment annealing curriculum trains the model from complete to incomplete and finally unaligned supervision, enabling robust performance across dense, sparse, and unaligned inputs. Experiments on DAVIS and FlexBench demonstrate superior trajectory control (low TrajErr, high TrajSIM) while maintaining competitive video quality, enabling practical applications in motion cloning, interpolation, camera redirection, and mesh animation.
Abstract
We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.
