Extreme Video Compression with Pre-trained Diffusion Models

Bohan Li; Yiming Liu; Xueyan Niu; Bo Bai; Lei Deng; Deniz Gündüz

Extreme Video Compression with Pre-trained Diffusion Models

Bohan Li, Yiming Liu, Xueyan Niu, Bo Bai, Lei Deng, Deniz Gündüz

TL;DR

This work tackles extreme video compression by leveraging a pre-trained diffusion model at the decoder to generate future frames from a small set of encoded frames. It combines neural image compression for a subset of frames with autoregressive diffusion-based frame generation, using a sequential encoding strategy driven by a perceptual quality threshold. The approach achieves ultra-low bitrates (as low as $0.02$ bpp) while delivering perceptually convincing reconstructions, outperforming traditional codecs in the low-bpp regime and demonstrating the value of exploiting temporal dependencies through generative models. The results highlight the potential of diffusion-based prediction for efficient video compression and suggest directions for reducing encoder complexity and extending to alternative generative models.

Abstract

Diffusion models have achieved remarkable success in generating high quality image and video data. More recently, they have also been used for image compression with high perceptual quality. In this paper, we present a novel approach to extreme video compression leveraging the predictive power of diffusion-based generative models at the decoder. The conditional diffusion model takes several neural compressed frames and generates subsequent frames. When the reconstruction quality drops below the desired level, new frames are encoded to restart prediction. The entire video is sequentially encoded to achieve a visually pleasing reconstruction, considering perceptual quality metrics such as the learned perceptual image patch similarity (LPIPS) and the Frechet video distance (FVD), at bit rates as low as 0.02 bits per pixel (bpp). Experimental results demonstrate the effectiveness of the proposed scheme compared to standard codecs such as H.264 and H.265 in the low bpp regime. The results showcase the potential of exploiting the temporal relations in video data using generative models. Code is available at: https://github.com/ElesionKyrie/Extreme-Video-Compression-With-Prediction-Using-Pre-trainded-Diffusion-Models-

Extreme Video Compression with Pre-trained Diffusion Models

TL;DR

bpp) while delivering perceptually convincing reconstructions, outperforming traditional codecs in the low-bpp regime and demonstrating the value of exploiting temporal dependencies through generative models. The results highlight the potential of diffusion-based prediction for efficient video compression and suggest directions for reducing encoder complexity and extending to alternative generative models.

Abstract

Paper Structure (13 sections, 4 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 4 equations, 4 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Video Compression Codecs
Video Prediction and Generation
Generative models for video compression
Methods
Preprocessing
Frame generation with diffusion-based models
The sequential encoding process
Experimental Results
Experimental Setup
Results
Conclusion and Future Direction

Figures (4)

Figure 1: Method overview. The first few frames are compressed by the encoder, while the following frames are generated using a pre-trained generative model at the decoder. When the generation quality drops below the desired threshold, new frames are encoded to sustain the overall visual quality.
Figure 2: Rate-distortion (perception) performance on the Cityscapes dataset.
Figure 3: Visual comparison with state-of-the-art codec on Cityscape dataset. We present the first 6 frames and last 3 frames between original video frames (top row), H.264 codec (second row) and the proposed compression scheme (bottom row). Both H.264 compression and the videos compressed by our model are controlled to have a bpp of 0.07.Our model utilizes the first two frames to autoregressively generate some following frames based on an LPIPS threshold of 0.16.
Figure 4: Visual comparison with state-of-the-art codec on SMMNIST dataset. We present the first 6 frames and last 3 frames between original video frames (top row), H.264 codec (second row) and the proposed compression scheme (bottom row). Both H.264 compression and the videos compressed by our model are controlled to have a bpp of 0.04,LPIPS threshold is 0.16. We generate frames conditioning on 5 frames.

Extreme Video Compression with Pre-trained Diffusion Models

TL;DR

Abstract

Extreme Video Compression with Pre-trained Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)