RAIN: Real-time Animation of Infinite Video Stream

Zhilei Shu; Ruili Feng; Yang Cao; Zheng-Jun Zha

RAIN: Real-time Animation of Infinite Video Stream

Zhilei Shu, Ruili Feng, Yang Cao, Zheng-Jun Zha

TL;DR

RAIN tackles the challenge of real-time, infinite video stream animation with diffusion models on consumer hardware. By introducing Temporal Adaptive Attention and expanding the frame-token stream with a factor $p$, it enables long-range temporal dependencies while maintaining low latency. The approach combines a Reference Mechanism, LCM Distillation, and a tailored architecture built on Stable Diffusion to preserve identity and motion across frames. Empirical results across human movement, cross-domain face morphing, and style transfer demonstrate real-time performance with improved continuity and robustness. This work enables practical live-video applications on standard GPUs and provides a blueprint for scalable, real-time diffusion-based video synthesis.

Abstract

Live animation has gained immense popularity for enhancing online engagement, yet achieving high-quality, real-time, and stable animation with diffusion models remains challenging, especially on consumer-grade GPUs. Existing methods struggle with generating long, consistent video streams efficiently, often being limited by latency issues and degraded visual quality over extended periods. In this paper, we introduce RAIN, a pipeline solution capable of animating infinite video streams in real-time with low latency using a single RTX 4090 GPU. The core idea of RAIN is to efficiently compute frame-token attention across different noise levels and long time-intervals while simultaneously denoising a significantly larger number of frame-tokens than previous stream-based methods. This design allows RAIN to generate video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams, resulting in enhanced continuity and consistency. Consequently, a Stable Diffusion model fine-tuned with RAIN in just a few epochs can produce video streams in real-time and low latency without much compromise in quality or consistency, up to infinite long. Despite its advanced capabilities, the RAIN only introduces a few additional 1D attention blocks, imposing minimal additional burden. Experiments in benchmark datasets and generating super-long videos demonstrating that RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors while costing less latency. All code and models will be made publicly available.

RAIN: Real-time Animation of Infinite Video Stream

TL;DR

, it enables long-range temporal dependencies while maintaining low latency. The approach combines a Reference Mechanism, LCM Distillation, and a tailored architecture built on Stable Diffusion to preserve identity and motion across frames. Empirical results across human movement, cross-domain face morphing, and style transfer demonstrate real-time performance with improved continuity and robustness. This work enables practical live-video applications on standard GPUs and provides a blueprint for scalable, real-time diffusion-based video synthesis.

Abstract

Paper Structure (41 sections, 7 equations, 11 figures, 3 tables)

This paper contains 41 sections, 7 equations, 11 figures, 3 tables.

Introduction
Related Work
Motion Transfer
Stream Video Processing
Video Style Transfer
Preliminaries
Consistency Model
Stream Diffusion
Reference Mechanism
Method
Temporal Adaptive Attention
Train and Inference
LCM Distillation
Architecture
Experiments
...and 26 more sections

Figures (11)

Figure 1: Cross domain face morphing generation results. We achieve real-time animation of anime characters. Expression of real human can be successfully ported to anime characters, and the generation is stable, consistency and infinite long.
Figure 2: Animation clips for crossdomain face morphing. Best viewed with Acrobat Reader. Click the images to play the animation clips.
Figure 3: The overview pipeline of RAIN. We first feed the reference image into Reference UNet and CLIP Text Encoder, the spatial attention feature from Reference UNet and CLIP embeddings are fed into Denoising UNet. The pose sequence is mapped through pose guider and added to the intermediate feature after post convolutional layer of Denoising UNet. Every times after $N$ iterations of UNet function calls, the noise level of each frames is reduced by $T/p$ steps, and the first $K/p$ frames are already clean. We pop out first $K/p$ frames and push$K/p$ frames of standard noise to the latent piles. Each clean latent is then decoded by VAE Decoder as a video frame.
Figure 4: Generation results from UBC-Fashion test dataset.
Figure 5: Results of cross domain face morphing: the two leftmost columns are the original DWPose sequence and the transformed landmarks. Characters' expressions follows the input exactly. However, for different characters and humans, the transformation parameters need to be adjusted accordingly. For example, the length of face and size of eyes are varying for different characters.
...and 6 more figures

RAIN: Real-time Animation of Infinite Video Stream

TL;DR

Abstract

RAIN: Real-time Animation of Infinite Video Stream

Authors

TL;DR

Abstract

Table of Contents

Figures (11)