Table of Contents
Fetching ...

FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models

Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, Ziwei Liu

TL;DR

Diffusion-based video generation enables rich motion, but explicit trajectory control typically requires training. This work reveals that both initial noises and internal attention computations encode trajectory information, and introduces FreeTraj, a tuning-free framework that guides noise construction and attention to achieve trajectory control without retraining. By coupling noise-guidance with carefully designed attention edits and extending to longer and larger videos via integration with FreeNoise, FreeTraj delivers improved trajectory controllability while maintaining competitive video quality. Quantitative and user studies demonstrate stronger trajectory alignment and favorable perception, establishing a practical, training-free path to fine-grained motion control in open-domain video synthesis.

Abstract

Diffusion model has demonstrated remarkable capability in video generation, which further sparks interest in introducing trajectory control into the generation process. While existing works mainly focus on training-based methods (e.g., conditional adapter), we argue that diffusion model itself allows decent control over the generated content without requiring any training. In this study, we introduce a tuning-free framework to achieve trajectory-controllable video generation, by imposing guidance on both noise construction and attention computation. Specifically, 1) we first show several instructive phenomenons and analyze how initial noises influence the motion trajectory of generated content. 2) Subsequently, we propose FreeTraj, a tuning-free approach that enables trajectory control by modifying noise sampling and attention mechanisms. 3) Furthermore, we extend FreeTraj to facilitate longer and larger video generation with controllable trajectories. Equipped with these designs, users have the flexibility to provide trajectories manually or opt for trajectories automatically generated by the LLM trajectory planner. Extensive experiments validate the efficacy of our approach in enhancing the trajectory controllability of video diffusion models.

FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models

TL;DR

Diffusion-based video generation enables rich motion, but explicit trajectory control typically requires training. This work reveals that both initial noises and internal attention computations encode trajectory information, and introduces FreeTraj, a tuning-free framework that guides noise construction and attention to achieve trajectory control without retraining. By coupling noise-guidance with carefully designed attention edits and extending to longer and larger videos via integration with FreeNoise, FreeTraj delivers improved trajectory controllability while maintaining competitive video quality. Quantitative and user studies demonstrate stronger trajectory alignment and favorable perception, establishing a practical, training-free path to fine-grained motion control in open-domain video synthesis.

Abstract

Diffusion model has demonstrated remarkable capability in video generation, which further sparks interest in introducing trajectory control into the generation process. While existing works mainly focus on training-based methods (e.g., conditional adapter), we argue that diffusion model itself allows decent control over the generated content without requiring any training. In this study, we introduce a tuning-free framework to achieve trajectory-controllable video generation, by imposing guidance on both noise construction and attention computation. Specifically, 1) we first show several instructive phenomenons and analyze how initial noises influence the motion trajectory of generated content. 2) Subsequently, we propose FreeTraj, a tuning-free approach that enables trajectory control by modifying noise sampling and attention mechanisms. 3) Furthermore, we extend FreeTraj to facilitate longer and larger video generation with controllable trajectories. Equipped with these designs, users have the flexibility to provide trajectories manually or opt for trajectories automatically generated by the LLM trajectory planner. Extensive experiments validate the efficacy of our approach in enhancing the trajectory controllability of video diffusion models.

Paper Structure

This paper contains 23 sections, 15 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Noise resampling of initial high-frequency components. Gradually increasing the proportion of resampled high-frequency information in the frame-wise shared noises can significantly reduce the artifact in the generated video. However, this also leads to a gradual loss in trajectory control ability. A resampling percentage of 75% strikes a better balance between maintaining control and improving the quality of the generated video.
  • Figure 2: Trajectory control via frame-wise shared low-frequency noise. The success cases on the left demonstrate that the moving objects in the generated videos can be roughly controlled by sharing low-frequent noise across the bounding boxes of the given trajectory. However, the precision of control and the success rate remain somewhat constrained, as evidenced by the failure instances on the right.
  • Figure 3: An overview of FreeTraj. Our framework mainly contains two parts: guidance in noise and guidance in attention. For noise, we inject the target trajectory into the low-frequency part. For attention, we design different reweighing strategies according to the supposed behaviors in different attention layers. Here $\mathcal{S}$, $M_{C A}$, $M_{S A}$, and $M_{T A}$ are different attention masks.
  • Figure 4: Qualitative comparison of trajectory control. We compare our proposed FreeTraj with direct inference (Direct), Peekaboo (Peek), and TrailBlazer (TraB). FreeTraj successfully generates high-fidelity results and is more accurate for trajectory control.
  • Figure 5: Ablation results. (a) No noise guidance, (b) no high-frequency noise resampling, (c) hard attention mask, and (d) our whole method.
  • ...and 4 more figures