Table of Contents
Fetching ...

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, David B. Lindell

TL;DR

<3-5 sentence high-level summary> SG-I2V tackles the challenge of controllable image-to-video generation without fine-tuning by introducing a self-guided framework that relies on semantically aligned features within a pre-trained diffusion model. It aligns cross-frame feature representations via a modified self-attention mechanism, then optimizes the latent input to enforce trajectory-consistent motion inside user-defined bounding boxes, accompanied by a high-frequency-preserving post-processing step. The approach achieves zero-shot object and camera motion control with competitive visual quality and motion fidelity on VIPSeg, narrowing the gap to supervised baselines. This work highlights how internal representations of image-to-video diffusion models can be exploited for intuitive, annotation-free motion control in video synthesis.

Abstract

Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided$\unicode{x2013}$offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while significantly narrowing down the performance gap with supervised models in terms of visual quality and motion fidelity. Additional details and video results are available on our project page: https://kmcode1.github.io/Projects/SG-I2V

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

TL;DR

<3-5 sentence high-level summary> SG-I2V tackles the challenge of controllable image-to-video generation without fine-tuning by introducing a self-guided framework that relies on semantically aligned features within a pre-trained diffusion model. It aligns cross-frame feature representations via a modified self-attention mechanism, then optimizes the latent input to enforce trajectory-consistent motion inside user-defined bounding boxes, accompanied by a high-frequency-preserving post-processing step. The approach achieves zero-shot object and camera motion control with competitive visual quality and motion fidelity on VIPSeg, narrowing the gap to supervised baselines. This work highlights how internal representations of image-to-video diffusion models can be exploited for intuitive, annotation-free motion control in video synthesis.

Abstract

Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guidedoffering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while significantly narrowing down the performance gap with supervised models in terms of visual quality and motion fidelity. Additional details and video results are available on our project page: https://kmcode1.github.io/Projects/SG-I2V

Paper Structure

This paper contains 32 sections, 2 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Image-to-video generation based on self-guided trajectory control. Given a set of bounding boxes with associated trajectories, we achieve object and camera motion control in image-to-video generation by leveraging the knowledge present in a pre-trained image-to-video diffusion model. Our method is self-guided, offering zero-shot trajectory control without fine-tuning or relying on external knowledge.
  • Figure 2: Semantic correspondences in video diffusion models. We analyze feature maps in the image-to-video diffusion model SVD blattmann2023stable for three generated video sequences (row 1). We use PCA to visualize the features at diffusion timestep 30 (out of 50) at the output of an upsampling block (row 2), a self-attention layer (row 3), and the same self-attention layer after our alignment procedure (row 4). Although output feature maps of upsampling blocks in image diffusion models are known to encode semantic information tang2023emergent, we only observe weak semantic correspondences across frames in SVD. Thus, we focus on the self-attention layer and modify it to produce feature maps that are semantically aligned across frames.
  • Figure 3: Overview of the controllable image-to-video generation framework. To control trajectories of scene elements, we optimize the latent $\bm{z}_t$ at specific denoising timesteps $t$ of a pre-trained video diffusion model. First, we extract semantically aligned feature maps from the denoising U-Net to estimate the video layout. Next, we enforce cross-frame feature similarity along the bounding box trajectory to drive the motion of each region. To preserve the visual quality of the generated video, a frequency-based post-processing method is applied to retain high-frequency noise of the original latent $\bm{z}_t$. The updated latent $\tilde{\bm{z}}_t$ is input to the next denoising step.
  • Figure 4: Failure cases in supervised baselines. We observe that DragNUWA tends to distort objects rather than move them, and DragAnything is weak at part-level control as it is designed for entity-level control. In contrast, our method can generate videos with natural motion for diverse object and camera trajectories. Please see https://kmcode1.github.io/Projects/SG-I2V#baseline-comparison for additional comparisons.
  • Figure 5: Performance across U-Net feature maps used to compute loss in \ref{['eq:feature-optimization']}. For all metrics, lower values are better. Temporal and spatial refer to the temporal and spatial self-attention layers. We find that features extracted from self-attention layers generally perform better than those from upsampling blocks and temporal attention layers. In addition, using the feature maps of our modified self-attention layer achieves the best results, since they are semantically aligned across frames. Corresponding qualitative visuals are presented in \ref{['fig:results_attn']} and https://kmcode1.github.io/Projects/SG-I2V#ablation-unet-featuremap
  • ...and 11 more figures