Table of Contents
Fetching ...

Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance

Maojun Zhang, Haotian Wu, Richeng Jin, Deniz Gunduz, Krystian Mikolajczyk

TL;DR

This work tackles extreme video compression by shifting from pixel-level motion to semantics-guided diffusion-based reconstruction. It decomposes motion into background camera pose and foreground segmentation maps, encoding compact representations while guiding a diffusion model to reconstruct frames with semantic consistency. Key contributions include FlowMap-based camera pose estimation, an LLM-assisted, SAM2-driven foreground mask pipeline, and adapters for a diffusion model that fuse pose and segmentation cues via Plücker embeddings and ControlNet. Empirical results on DAVIS and RealEstate10K show CPSGD reaching on the order of $0.003$ BPP with superior perceptual and distortion metrics at ultra-low bitrates, highlighting the practical potential of semantic-driven diffusion for video compression.

Abstract

Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm for video compression by leveraging high-level semantic understanding and powerful visual synthesis. This paper propose a video compression framework that integrates generative priors to drastically reduce bit-rate while maintaining reconstruction fidelity. Specifically, our method compresses high-level semantic representations of the video, then uses a conditional diffusion model to reconstruct frames from these semantics. To further improve compression, we characterize motion information with global camera trajectories and foreground segmentation: background motion is compactly represented by camera pose parameters while foreground dynamics by sparse segmentation masks. This allows for significantly boosts compression efficiency, enabling descent video reconstruction at extremely low bit-rates.

Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance

TL;DR

This work tackles extreme video compression by shifting from pixel-level motion to semantics-guided diffusion-based reconstruction. It decomposes motion into background camera pose and foreground segmentation maps, encoding compact representations while guiding a diffusion model to reconstruct frames with semantic consistency. Key contributions include FlowMap-based camera pose estimation, an LLM-assisted, SAM2-driven foreground mask pipeline, and adapters for a diffusion model that fuse pose and segmentation cues via Plücker embeddings and ControlNet. Empirical results on DAVIS and RealEstate10K show CPSGD reaching on the order of BPP with superior perceptual and distortion metrics at ultra-low bitrates, highlighting the practical potential of semantic-driven diffusion for video compression.

Abstract

Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm for video compression by leveraging high-level semantic understanding and powerful visual synthesis. This paper propose a video compression framework that integrates generative priors to drastically reduce bit-rate while maintaining reconstruction fidelity. Specifically, our method compresses high-level semantic representations of the video, then uses a conditional diffusion model to reconstruct frames from these semantics. To further improve compression, we characterize motion information with global camera trajectories and foreground segmentation: background motion is compactly represented by camera pose parameters while foreground dynamics by sparse segmentation masks. This allows for significantly boosts compression efficiency, enabling descent video reconstruction at extremely low bit-rates.
Paper Structure (8 sections, 4 equations, 7 figures, 1 table)

This paper contains 8 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Comparison of the proposed framework with modern video codecs and learning-based compression framework.
  • Figure 1: Compression analysis of different parts in the proposed CPSGD scheme.
  • Figure 2: Extreme video compression framework via hierarchical motion semantics representation and compression
  • Figure 3: Foreground moving objects segmentation via in context learning with large language model.
  • Figure 4: Extreme video compression framework via hierarchical motion semantics representation and compression.
  • ...and 2 more figures