Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models
Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen
TL;DR
Live2Diff enables live streaming video translation by introducing uni-directional temporal self-attention with a warmup region, paired with a $K{}V{}$-cache and pipelined denoising to achieve interactive framerates. The method replaces bidirectional temporal attention in prior video diffusion models with a masked, autoregressive design, while preserving temporal consistency through warmup frames that contribute to all future outputs. Depth conditioning provides structural fidelity to the input video, and a lightweight conditioning module stabilizes spatial details during style transfer. Empirical results on DAVIS-2017 demonstrate improved structure consistency and competitive temporal smoothness, with 16 FPS achieved on an RTX 4090, highlighting practical potential for real-time live-stream editing and virtual-video applications, alongside acknowledged limitations and ethical considerations.
Abstract
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token and previous tokens. However, video streaming remains much less explored, despite a growing need for live video processing. State-of-the-art video diffusion models leverage bi-directional temporal attention to model the correlations between the current frame and all the surrounding (i.e. including future) frames, which hinders them from processing streaming videos. To address this problem, we present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation. Compared to previous works, our approach ensures temporal consistency and smoothness by correlating the current frame with its predecessors and a few initial warmup frames, without any future frames. Additionally, we use a highly efficient denoising scheme featuring a KV-cache mechanism and pipelining, to facilitate streaming video translation at interactive framerates. Extensive experiments demonstrate the effectiveness of the proposed attention mechanism and pipeline, outperforming previous methods in terms of temporal smoothness and/or efficiency.
