Table of Contents
Fetching ...

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen

TL;DR

Live2Diff enables live streaming video translation by introducing uni-directional temporal self-attention with a warmup region, paired with a $K{}V{}$-cache and pipelined denoising to achieve interactive framerates. The method replaces bidirectional temporal attention in prior video diffusion models with a masked, autoregressive design, while preserving temporal consistency through warmup frames that contribute to all future outputs. Depth conditioning provides structural fidelity to the input video, and a lightweight conditioning module stabilizes spatial details during style transfer. Empirical results on DAVIS-2017 demonstrate improved structure consistency and competitive temporal smoothness, with 16 FPS achieved on an RTX 4090, highlighting practical potential for real-time live-stream editing and virtual-video applications, alongside acknowledged limitations and ethical considerations.

Abstract

Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token and previous tokens. However, video streaming remains much less explored, despite a growing need for live video processing. State-of-the-art video diffusion models leverage bi-directional temporal attention to model the correlations between the current frame and all the surrounding (i.e. including future) frames, which hinders them from processing streaming videos. To address this problem, we present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation. Compared to previous works, our approach ensures temporal consistency and smoothness by correlating the current frame with its predecessors and a few initial warmup frames, without any future frames. Additionally, we use a highly efficient denoising scheme featuring a KV-cache mechanism and pipelining, to facilitate streaming video translation at interactive framerates. Extensive experiments demonstrate the effectiveness of the proposed attention mechanism and pipeline, outperforming previous methods in terms of temporal smoothness and/or efficiency.

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

TL;DR

Live2Diff enables live streaming video translation by introducing uni-directional temporal self-attention with a warmup region, paired with a -cache and pipelined denoising to achieve interactive framerates. The method replaces bidirectional temporal attention in prior video diffusion models with a masked, autoregressive design, while preserving temporal consistency through warmup frames that contribute to all future outputs. Depth conditioning provides structural fidelity to the input video, and a lightweight conditioning module stabilizes spatial details during style transfer. Empirical results on DAVIS-2017 demonstrate improved structure consistency and competitive temporal smoothness, with 16 FPS achieved on an RTX 4090, highlighting practical potential for real-time live-stream editing and virtual-video applications, alongside acknowledged limitations and ethical considerations.

Abstract

Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token and previous tokens. However, video streaming remains much less explored, despite a growing need for live video processing. State-of-the-art video diffusion models leverage bi-directional temporal attention to model the correlations between the current frame and all the surrounding (i.e. including future) frames, which hinders them from processing streaming videos. To address this problem, we present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation. Compared to previous works, our approach ensures temporal consistency and smoothness by correlating the current frame with its predecessors and a few initial warmup frames, without any future frames. Additionally, we use a highly efficient denoising scheme featuring a KV-cache mechanism and pipelining, to facilitate streaming video translation at interactive framerates. Extensive experiments demonstrate the effectiveness of the proposed attention mechanism and pipeline, outperforming previous methods in terms of temporal smoothness and/or efficiency.
Paper Structure (20 sections, 4 equations, 8 figures, 4 tables)

This paper contains 20 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We visualize different types of temporal self-attention when the number of frames ($F = 8$) exceeds the length of the context window ($L = 4$). The $j$-th cell of the $i$-th row is highlighted if the output for frame $i$ may contain information from frame $j$. The red square delineates the attention mask used during training. (a) shows temporal self-attention in current video diffusion models, which is bi-directional within the context window without overlap between chunks. (b) uses a sliding window with overlap $L_s$ (three subsequent positions of which are highlighted in different colors, for clarity) and fuses the output of overlap regions. (c) denotes the uni-directional attention widely used in LLMs. (d) shows the attention proposed by our method. We set the initial $L_w$ frames as warmup frames and apply bi-directional attention to them, while using uni-directional attention for the subsequent frames. The initial warmup frames also contribute to the output for all future frames.
  • Figure 2: The overview of Live2Diff. (a) During training, our model takes as inputs $L$ frames of noisy latents $z_t^{f:f+L}$ and depth conditioning $y^{f:f+L}$, where $f:f+L$ delimits the frame interval in a video stream, $t$ is the denoising timestep, $\oplus$ denotes point-wise addition. (b) During inference, frame $z^{f+1}$ is incorporated into the processing batch as it streams in, often before earlier frames are fully denoised. This results in a batch that includes frames at various denoising timesteps (e.g., $t_1$ and $t_2$). We employ a $K{}V{}$-cache mechanism to effectively reuse $K$ and $V$ maps from previous frames, significantly improving inference efficiency while ensuring temporal consistency.
  • Figure 3: The X-T slice shows how the pixel values at the same X-coordinate change over time T. The position of the horizontal lines in the video corresponds to the X-coordinate positions visualized in the X-T slice. The color of each line represents the time in the X-T plot. Red dashed boxes denote regions suffering from flickering and structural inconsistency, while blue boxes indicate areas where these issues are resolved. Flickering and gradual change in the background region can be observed in (b), (c) and (d), which use the first three attention modes illustrated in \ref{['fig:attn_mask']} respectively. In case (e), with the last attention mode from \ref{['fig:attn_mask']} (see also \ref{['meth:tsa']}, background flickering is reduced. The depth conditioning in (f) improves structure consistency further.
  • Figure 4: (a) - (c) depict the usage of our $K{}V{}$-cache for the first steps of a stream, with $L_w = T = 2$. The colors of the squares indicate which frame they belong to. $Q, K, V$ are the matrices used in \ref{['equ:attn']}, with subscripts indicating which denoising step they are used in and superscripts indicating which frame they belong to. Each row belongs to one of the two denoising steps. Red arrows are overwrite operations. For a step by step walkthrough see \ref{['meth:pipeline']}.
  • Figure 5: We compare the output quality of our method to a number of previous approaches: (a) shows temporally adjacent frames, while (b) shows frames temporally further apart. While our method preserves the spatial structure of the input well, producing the desired output styles, previous methods tend to change even the semantic content of the frames. See more discussions in \ref{['sec:comparison']}
  • ...and 3 more figures