DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

Cong Wang; Jiaxi Gu; Panwen Hu; Songcen Xu; Hang Xu; Xiaodan Liang

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

Cong Wang, Jiaxi Gu, Panwen Hu, Songcen Xu, Hang Xu, Xiaodan Liang

TL;DR

DreamVideo tackles the fidelity–flicker trade-off in image-to-video generation by introducing a frame-retention branch that injects image-derived signals into a pre-trained video diffusion backbone. By extracting image features with convolutional layers and fusing them with the latent diffusion process, the method preserves the input image details while enabling motion control via text, further enhanced by double-condition classifier-free guidance for image-and-text conditioning. Empirical results on UCF101 and MSR-VTT demonstrate strong image retention and state-of-the-art or competitive video quality (low FVD, high IS and FFF metrics), with the ability to lengthen videos through Two-Stage Inference and to produce varied outputs from the same initial frame by changing prompts. DreamVideo thus offers a scalable, production-friendly pipeline that integrates image fidelity with flexible text-driven motion, holding promise for controllable video generation in real-world applications.

Abstract

Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process at a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenates the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, and both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

TL;DR

Abstract

Paper Structure (19 sections, 5 equations, 7 figures, 3 tables)

This paper contains 19 sections, 5 equations, 7 figures, 3 tables.

Introduction
Related work
Video diffusion models
Image-to-video generation
Preliminary
Method
Model Structure
Image Retention
Classifier-free Guidance for Two Conditionings
Inference
Experiments
Experimental setup
Qualitative evaluation
Quantitative evaluation
Varied textual inputs
...and 4 more sections

Figures (7)

Figure 1: With DreamVideos, high fidelity is obtained between the input image and the first frame of the generated video, e.g., "a young black child". Moreover, text guidance also helps to control the motion of the video content including "dance" and "walk". The images in showcases are from MidJourney.
Figure 2: The architecture of DreamVideo. A reference image is processed by a convolution block and concatenated with the representation of noisy latents. The Image Retention Module, as a side branch copying from the downsample blocks of U-Net, plays a role in maintaining the visual details from the input image and meanwhile also accepting text prompts for motion control.
Figure 3: These are the generated first frames using different image guidance scales (GS) for classifier-free guidance.
Figure 4: Comparative analysis across methods: I2Vgen-XL, VideoCrafter1, and Ours (top to bottom arrangement). Leftmost is the original image, columns 1-3 indicate frames 0-3, and the final column presents the last frame.
Figure 5: Given the same image, different text prompts can lead to different output videos. This can be seen as evidence of our text-guidance capability for controlling video content motion.
...and 2 more figures

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

TL;DR

Abstract

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (7)