Table of Contents
Fetching ...

PoseTraj: Pose-Aware Trajectory Control in Video Diffusion

Longbin Ji, Lei Zhong, Pengfei Wei, Changjian Li

TL;DR

PoseTraj tackles the challenge of generating videos where objects follow rotational trajectories that induce changes in 6D pose. It introduces a two-stage pose-aware pretraining regime and a synthetic PoseTraj-10K dataset with precise 3D bounding boxes to teach 3D pose understanding, followed by camera-disentangled finetuning to adapt to real-world videos. The approach leverages a Traj-ControlNet built on a latent diffusion model, with an injection-by-reconstruction strategy that uses 3D bbox supervision as an intermediate signal and can remove this signal during inference. Experiments on VIPSeg and DAVIS show state-of-the-art trajectory-following accuracy and video quality, with strong robustness to camera motion and rotational dynamics, highlighting the method's practical potential for controllable video generation.

Abstract

Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.

PoseTraj: Pose-Aware Trajectory Control in Video Diffusion

TL;DR

PoseTraj tackles the challenge of generating videos where objects follow rotational trajectories that induce changes in 6D pose. It introduces a two-stage pose-aware pretraining regime and a synthetic PoseTraj-10K dataset with precise 3D bounding boxes to teach 3D pose understanding, followed by camera-disentangled finetuning to adapt to real-world videos. The approach leverages a Traj-ControlNet built on a latent diffusion model, with an injection-by-reconstruction strategy that uses 3D bbox supervision as an intermediate signal and can remove this signal during inference. Experiments on VIPSeg and DAVIS show state-of-the-art trajectory-following accuracy and video quality, with strong robustness to camera motion and rotational dynamics, highlighting the method's practical potential for controllable video generation.

Abstract

Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.

Paper Structure

This paper contains 24 sections, 4 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: PoseTraj produces plausible dragging videos, where objects follow a rotational trajectory with awareness of changing poses.
  • Figure 2: The video generation performance of DragAnything $vs.$ Ours under complex trajectories on static and dynamic objects. The yellow dashed line indicates the boat's orientation and helps perceive object rotation, while the orange circle highlights visual defects.
  • Figure 3: Data construction pipeline of our synthetic dataset PoseTraj-10K. The whole configuration, including environment setup, object sampling, and trajectory sampling, is displayed.
  • Figure 4: Method overview. Our PoseTraj first utilizes two-stage pose-aware pre-training on our synthetic dataset to obtain 3D-enhanced awareness for rotational trajectory-following capacity and further exploits camera-disentangled finetuning to adapt this ability on open-domain videos. The dashed colorful arrows demonstrate the dedicated data flows during the three training stages, while the black arrows are shared in all stages. All yellow blocks are trainable, while other blocks are frozen.
  • Figure 5: Visual comparison. Given the same trajectory and the initial image, our method produces plausible video frames containing the moving object following the given trajectory. In contrast, DrafNUWA and DragAnything either introduce unexpected camera motions or fail to maintain the object entity, causing severe collapse.
  • ...and 8 more figures