PoseTraj: Pose-Aware Trajectory Control in Video Diffusion
Longbin Ji, Lei Zhong, Pengfei Wei, Changjian Li
TL;DR
PoseTraj tackles the challenge of generating videos where objects follow rotational trajectories that induce changes in 6D pose. It introduces a two-stage pose-aware pretraining regime and a synthetic PoseTraj-10K dataset with precise 3D bounding boxes to teach 3D pose understanding, followed by camera-disentangled finetuning to adapt to real-world videos. The approach leverages a Traj-ControlNet built on a latent diffusion model, with an injection-by-reconstruction strategy that uses 3D bbox supervision as an intermediate signal and can remove this signal during inference. Experiments on VIPSeg and DAVIS show state-of-the-art trajectory-following accuracy and video quality, with strong robustness to camera motion and rotational dynamics, highlighting the method's practical potential for controllable video generation.
Abstract
Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.
