Table of Contents
Fetching ...

Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance

June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, Kimin Lee

Abstract

Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple training-free fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying a low-pass filter at the early stage of denoising. Extensive experiments show ALG significantly improves the temporal dynamics of generated videos, while preserving or even improving image fidelity and text alignment. For instance, on the VBench test suite, ALG achieves a 33% average improvement across models in dynamic degree while maintaining the original video quality. For additional visualizations and source code, see the project page.

Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance

Abstract

Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple training-free fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying a low-pass filter at the early stage of denoising. Extensive experiments show ALG significantly improves the temporal dynamics of generated videos, while preserving or even improving image fidelity and text alignment. For instance, on the VBench test suite, ALG achieves a 33% average improvement across models in dynamic degree while maintaining the original video quality. For additional visualizations and source code, see the project page.

Paper Structure

This paper contains 24 sections, 6 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overcoming suppressed motion dynamics of I2V models with ALG. I2V models achieve high image fidelity to the conditioning image, but they often fail to generate dynamic videos (first row). We refer to this issue as suppressed motion dynamics, which is due to the high-frequency details present in the reference image. As a simple fix, applying low-pass filter to the input image improves the motion dynamics, yet degrades the per-frame image quality and fidelity (second row). Our method, ALG, applies low-pass filter to the conditioning image only at earlier steps, significantly enhancing the dynamic degree while preserving the image quality (third row).
  • Figure 2: Visualization of shortcut effect in I2V generation. Intermediate feature map visualization from Wan 2.1 wan reveal that default I2V generation (top) exhibits a "shortcut" completion where fine-grained details in the image appears quickly (yellow dashed box), which confines the trajectory and prevents coarse structure from forming, ending up with a static video. Applying a low-pass filter (bottom) suppresses this shortcut to allow details to emerge gradually, and such flexible trajectory helps generating dynamic motion.
  • Figure 3: Low-pass filtering improves motion dynamics. (a) We plot the dynamic degree of an I2V model (Wan 2.1 wan) by applying low-pass filter (e.g., downsampling) to the input image. We observe that dynamic degree (VBench vbench metric which quantifies dynamicness) increases and aesthetic quality (VBench vbench metric which measures per-frame image quality) decreases as we use stronger low-pass filtering. (b) We visualize the frames when applying low-pass filtering to the input image. While the videos become more dynamic using stronger low-pass filters, it sacrifices video quality as the model receives a blurry image as input (highlighted in red).
  • Figure 4: Qualitative comparison between ALG and CFG. We provide visual comparison between the videos generated by using default image-to-video generation method (CFG) and our method (ALG). The input conditioning frames are denoted with red outline. We observe that the videos using ALG show more dynamic motion (e.g., larger object movement, animal movement, or human action, and more complex background movements). The list of prompts and models used for each video is included in the supplementary material.
  • Figure 5: Component analysis with VBench-I2V. (a) As $t_\textrm{trans}$ increases from 0, dynamic degree increases rapidly, while quality metrics remain stable or slightly drops. This indicates that high-frequency signals prevent dynamic motions from forming in early generation steps. (b) Increasing the initial low-pass filter strength $\kappa_\ast$ shows that ALG can enhance dynamicness without significantly sacrificing video quality. (c) Both bilinear downsampling and Gaussian blur show enhanced dynamics over default I2V.
  • ...and 5 more figures