Table of Contents
Fetching ...

FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation

Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, Li Yuan

TL;DR

FlashI2V addresses conditional image leakage in Image-to-Video generation by introducing latent shifting, which implicitly encodes the conditioning information through a learnable projection of the conditional latents within flow-matching dynamics, avoiding direct concatenation. It further leverages Fourier-guided high-frequency magnitude features to accelerate convergence and enable controllable detail in the generated video. The approach yields strong out-of-domain generalization, achieving a dynamic degree score of 53.01 on Vbench-I2V with only 1.3B parameters, and outperforms larger baselines on key metrics while reducing color inconsistencies and slow-motion artifacts. Collectively, FlashI2V provides a practical, parameter-efficient solution to conditional leakage in I2V and demonstrates robust generalization across in-domain and out-of-domain data.

Abstract

In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Project page: https://pku-yuangroup.github.io/FlashI2V/

FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation

TL;DR

FlashI2V addresses conditional image leakage in Image-to-Video generation by introducing latent shifting, which implicitly encodes the conditioning information through a learnable projection of the conditional latents within flow-matching dynamics, avoiding direct concatenation. It further leverages Fourier-guided high-frequency magnitude features to accelerate convergence and enable controllable detail in the generated video. The approach yields strong out-of-domain generalization, achieving a dynamic degree score of 53.01 on Vbench-I2V with only 1.3B parameters, and outperforms larger baselines on key metrics while reducing color inconsistencies and slow-motion artifacts. Collectively, FlashI2V provides a practical, parameter-efficient solution to conditional leakage in I2V and demonstrates robust generalization across in-domain and out-of-domain data.

Abstract

In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Project page: https://pku-yuangroup.github.io/FlashI2V/

Paper Structure

This paper contains 25 sections, 20 equations, 9 figures, 4 tables, 7 algorithms.

Figures (9)

  • Figure 1: Conditional image leakage. (a) Conditional image leakage causes performance degradation issues, where the videos are sampled from Wan2.1-I2V-14B-480P with Vbench-I2V text-image pairs. (b) In the existing I2V paradigm, we observe that chunk-wise FVD on in-domain data increases over time, while chunk-wise FVD on out-of-domain data remains consistently high, indicating that the law learned on in-domain data by the existing paradigm fails to generalize to out-of-domain data.
  • Figure 2: Method overview. We extract features from the conditional image latents using a learnable projection, followed by the latent shifting to obtain a renewed intermediate state that implicitly contains the condition. Simultaneously, the conditional image latents undergo the Fourier Transform to extract high-frequency magnitude features as guidance, which are concatenated with noisy latents and injected into DiT. During inference, we begin with the shifted noise and progressively denoise following the ODE, ultimately decoding the video.
  • Figure 3: Method Comparison. We compare the quantitative performance of FlashI2V (1.3B) with CogVideoX1.5-5B-I2V cogvideox and Wan2.1-I2V-14B-480P wan. We observe that CogVideoX1.5 and Wan2.1 exhibit color inconsistency. Additionally, Wan2.1 tends to produce extremely slow-motion or even static videos. Thanks to the avoidance of conditional image leakage, FlashI2V effectively resolves these performance degradation issues.
  • Figure 4: Ablation Study. (a) Comparing the chunk-wise FVD variation patterns of different I2V paradigms on both the training and validation sets, it is observed that only FlashI2V exhibits the same time-increasing FVD variation pattern in both sets. This suggests that only FlashI2V is capable of applying the generation law learned from in-domain data to out-of-domain data. Additionally, FlashI2V has the lowest out-of-domain FVD, demonstrating its performance advantage. (b) From the training loss, we can observe that Fourier guidance accelerates the convergence of latent shifting. (c) Fourier guidance alone causes color deviation, while latent shifting alone leads to mismatched details. FlashI2V achieves consistency in both color and details.
  • Figure 5: Analysis of latent shifting and fourier guidance. (a) As training progresses, $\boldsymbol{\phi}(\cdot)$ gradually emphasizes the detailed information in the conditional image. (b) When a lower cutoff frequency percentile is used, more high-frequency information is injected. When the cutoff frequency percentile is set to 0.1, the graphical text at the end of the video remains unchanged, while with the cutoff frequency percentile set to 0.9, the graphical text becomes unrecognizable.
  • ...and 4 more figures