Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Xin Yuan; Jinoo Baek; Keyang Xu; Omer Tov; Hongliang Fei

Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Xin Yuan, Jinoo Baek, Keyang Xu, Omer Tov, Hongliang Fei

TL;DR

This work tackles text-to-video spatial super-resolution under limited high-resolution video data by reusing off-the-shelf image diffusion models. It inflates image diffusion weights into a video UNet and attaches a frame-wise temporal adapter to enforce temporal coherence while keeping the bulk of weights frozen. Across experiments on the Shutterstock dataset, the temporal adapter provides a favorable balance between SR quality, temporal consistency, and computational efficiency compared with full fine-tuning or zero-shot baselines. The results demonstrate data efficiency and practical scalability, with future directions including higher resolutions and longer video sequences.

Abstract

We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .

Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

TL;DR

Abstract

Paper Structure (10 sections, 2 equations, 9 figures, 1 table)

This paper contains 10 sections, 2 equations, 9 figures, 1 table.

Introduction
Related Work
Approach
Inflation with Image Weights
Temporal Adaptater with Frame-wise Attention
Experiments
Quantitative Results
Qualitative Results
Inflation is Data Efficient
Conclusion

Figures (9)

Figure 1: Overall architecture of our approach. Up: we inflate the UNet weights from a text-to-image model into a text-to-video model to perform a diffusion-based super-resolution task. Bottom: we inject and tune a temporal adapter in the inflated architecture while maintaining the UNet weights frozen.
Figure 2: Weights inflation from a text-to-image SR UNet to a text-to-video SR UNet.
Figure 3: Temporal adapter with attention that ensures temporal coherence across a video clip.
Figure 4: Visualization of different tuning methods after image model inflation, conditioned on text prompt "Dog dachshund on chromakey".
Figure 5: Text prompt: Camera follows cooking mezze machine rigate pasta in tomato sauce.
...and 4 more figures

Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

TL;DR

Abstract

Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (9)