Table of Contents
Fetching ...

Zo3T: Zero-Shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li

TL;DR

Zo3T tackles zero-shot trajectory-guided image-to-video generation by introducing a 3D-aware trajectory projection, and a soft, test-time training regime that co-adapts a latent state with a lightweight LoRA adapter. It further refines the denoising path through a one-step lookahead Guidance Field Rectification and preserves high-frequency details via Fourier Orthogonal Recomposition, applied selectively during early denoising steps. The method achieves superior motion fidelity and visual quality compared to both training-based and training-free baselines, while enabling flexible object and camera trajectory control at high resolution without fine-tuning. These contributions collectively advance practical, physically plausible, zero-shot I2V generation with robust 3D realism.

Abstract

Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

Zo3T: Zero-Shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

TL;DR

Zo3T tackles zero-shot trajectory-guided image-to-video generation by introducing a 3D-aware trajectory projection, and a soft, test-time training regime that co-adapts a latent state with a lightweight LoRA adapter. It further refines the denoising path through a one-step lookahead Guidance Field Rectification and preserves high-frequency details via Fourier Orthogonal Recomposition, applied selectively during early denoising steps. The method achieves superior motion fidelity and visual quality compared to both training-based and training-free baselines, while enabling flexible object and camera trajectory control at high resolution without fine-tuning. These contributions collectively advance practical, physically plausible, zero-shot I2V generation with robust 3D realism.

Abstract

Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

Paper Structure

This paper contains 26 sections, 7 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Versatile Trajectory Control with Our Method. Given a set of bounding boxes with corresponding trajectories, our framework enables precise control over diverse object and camera motions. By leveraging the inherent knowledge of a pre-trained video diffusion model, we achieve zero-shot trajectory guidance without any fine-tuning.
  • Figure 2: Advantages of our Method over prior works.
  • Figure 3: An overview of our zero-shot trajectory-guided video generation framework. Our method optimizes a pre-trained video diffusion model at specific denoising timesteps via two key stages. First, Test-Time Training (TTT) adapts the latent state and an ephemeral adapter to maintain semantic consistency along the trajectory. Second, Guidance Field Rectification refines the denoising direction using a one-step lookahead optimization to ensure precise path execution.
  • Figure 4: Qualitative Comparison with SOTA Methods.
  • Figure 5: User Study. The majority of participants preferred the results obtained by our method over both training-free and training-based methods, attributing this preference to its better trajectory alignment and more natural motion generation
  • ...and 5 more figures