AtomoVideo: High Fidelity Image-to-Video Generation
Litong Gong, Yiran Zhu, Weijie Li, Xiaoyang Kang, Biao Wang, Tiezheng Ge, Bo Zheng
TL;DR
This work introduces AtomoVideo, a high-fidelity image-to-video framework that preserves fidelity to a reference image while enabling expressive motion. It achieves this by injecting image information at both low-level and high-level channels, using a fixed text-to-image backbone augmented with temporal modules, and training only these added components. The approach supports long-sequence generation via iterative frame prediction and remains compatible with personalized models through adapter-based conditioning. Quantitative and qualitative evaluations show strong image fidelity and superior motion, with notable stability across diverse scenarios, highlighting its potential for controllable, high-quality I2V generation.
Abstract
Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website: https://atomo-video.github.io/.
