Table of Contents
Fetching ...

AtomoVideo: High Fidelity Image-to-Video Generation

Litong Gong, Yiran Zhu, Weijie Li, Xiaoyang Kang, Biao Wang, Tiezheng Ge, Bo Zheng

TL;DR

This work introduces AtomoVideo, a high-fidelity image-to-video framework that preserves fidelity to a reference image while enabling expressive motion. It achieves this by injecting image information at both low-level and high-level channels, using a fixed text-to-image backbone augmented with temporal modules, and training only these added components. The approach supports long-sequence generation via iterative frame prediction and remains compatible with personalized models through adapter-based conditioning. Quantitative and qualitative evaluations show strong image fidelity and superior motion, with notable stability across diverse scenarios, highlighting its potential for controllable, high-quality I2V generation.

Abstract

Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website: https://atomo-video.github.io/.

AtomoVideo: High Fidelity Image-to-Video Generation

TL;DR

This work introduces AtomoVideo, a high-fidelity image-to-video framework that preserves fidelity to a reference image while enabling expressive motion. It achieves this by injecting image information at both low-level and high-level channels, using a fixed text-to-image backbone augmented with temporal modules, and training only these added components. The approach supports long-sequence generation via iterative frame prediction and remains compatible with personalized models through adapter-based conditioning. Quantitative and qualitative evaluations show strong image fidelity and superior motion, with notable stability across diverse scenarios, highlighting its potential for controllable, high-quality I2V generation.

Abstract

Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website: https://atomo-video.github.io/.
Paper Structure (12 sections, 1 equation, 8 figures, 1 table)

This paper contains 12 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Given a reference image and prompt, AtomoVideo can generates vivid videos while maintaining high fidelity detail with the given image.
  • Figure 2: The framework of our image-to-video method. During training, only the temporal and input layers are trained, and during testing, the noise latent is a sampled from Gaussian distribution without any reference image prior.
  • Figure 3: Illustration of video prediction. Given a length $L$ sequence of video frames, predicting the subsequent frames of $T-L$ is performed by making adaptation only at the input layer, with no additional adjustment of the model. And $T$ denotes the maximum sequence of frames supported by the model.
  • Figure 4: Samples comparison with other methods. We compare the SVDblattmann2023stable, Pikapika and Gen-2gen2, where AtomoVideo maintains better stability and greater motion intensity.
  • Figure 5: More samples with $512\times512$ size.
  • ...and 3 more figures