Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Mingde Zhou; Zheng Chen; Yulun Zhang

Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Mingde Zhou, Zheng Chen, Yulun Zhang

Abstract

Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.

Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Abstract

Paper Structure (14 sections, 10 equations, 7 figures, 5 tables)

This paper contains 14 sections, 10 equations, 7 figures, 5 tables.

Introduction
Related Works
Learned and Generative Video Compression
Diffusion Models in Video Compression
Optical Flow in Video Compression
Methodology
Sparse Temporal Encoding Module (STEM)
One-Step Video Diffusion with Frame Type Embedder (ODFTE)
Training and Inferencing
Experiments
Experimental Settings
Main Results
Ablation Study
Conclusion

Figures (7)

Figure 1: (1) LPIPS-bitrate-coherence comparison on MCL-JCV. Temporal coherence is measured by Ewarp lai2018learning. Our method achieves the best perceptual quality as well as much higher temporal coherence. Because all the diffusion-based video compression methods have not opened source, we choose diffusion-based image compression method (SODEC) chen2025steering for comparison. (2) Qualitative comparison on HEVC class B BQTerrace. Our method recovers fine-grained textures with high perceptual quality.
Figure 2: The overall architecture of our proposed Diff-SIT model. It consists of two main part: the STEM and ODFTE. Given an input frame sequence $\mathbf{x}$, first, it will be divided into backbone frames and MV frames. In STEM, there are two steps: (1) Backbone compression: the backbone frames are first compressed and reconstructed. (2) MV compression: each reconstructed backbone frame is used as a reference and MV compression is conducted to compress all the MV frames. After the STEM, the intermediate reconstructed sequence $\tilde{\mathbf{x}}$ will be fed into the ODFTE to be restored into final frame sequence $\hat{\mathbf{x}}$. Image flow refers to the coding order of video frames.
Figure 3: Pipeline of P-frame compression and MV compression. In P-frame compression, the reference frame $\tilde{x}_{3t+2}$ is used to compress the target frame $x_{3t+5}$ via conditional encoding. In MV compression, the reference frame $\tilde{x}_{3t+2}$ is used to compress frames $x_{3t+1}$ and $x_{3t+3}$ via MV compression. The reconstructed flow field $\hat{\mathbf{F}}_{\tilde{x}_{3t+2} \to x_{3t+1}}$ and $\hat{\mathbf{F}}_{\tilde{x}_{3t+2} \to x_{3t+3}}$ are used to warp the reference frame $\tilde{x}_{3t+2}$ to reconstruct the MV frames ($\tilde{x}_{3t+1}$) and ($\tilde{x}_{3t+3}$) respectively. Here, we use $\tilde{x}_{3t+3}$ as an example.
Figure 4: Analysis of reconstruction quality and bitrate cost versus the length of the continuous optical flow prediction chain. Reconstruction quality rapidly degrades as the prediction chain grows, while the required bitrate also increases significantly. Image at right is the corresponding 1-9 frames, from top-left to bottom-right.
Figure 5: Quantitative comparison with state-of-the-art methods on the HEVC Class B, MCL-JCV and UVG datasets. $\downarrow$ means lower is better.
...and 2 more figures

Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Abstract

Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Authors

Abstract

Table of Contents

Figures (7)