Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance
Maojun Zhang, Haotian Wu, Richeng Jin, Deniz Gunduz, Krystian Mikolajczyk
TL;DR
This work tackles extreme video compression by shifting from pixel-level motion to semantics-guided diffusion-based reconstruction. It decomposes motion into background camera pose and foreground segmentation maps, encoding compact representations while guiding a diffusion model to reconstruct frames with semantic consistency. Key contributions include FlowMap-based camera pose estimation, an LLM-assisted, SAM2-driven foreground mask pipeline, and adapters for a diffusion model that fuse pose and segmentation cues via Plücker embeddings and ControlNet. Empirical results on DAVIS and RealEstate10K show CPSGD reaching on the order of $0.003$ BPP with superior perceptual and distortion metrics at ultra-low bitrates, highlighting the practical potential of semantic-driven diffusion for video compression.
Abstract
Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm for video compression by leveraging high-level semantic understanding and powerful visual synthesis. This paper propose a video compression framework that integrates generative priors to drastically reduce bit-rate while maintaining reconstruction fidelity. Specifically, our method compresses high-level semantic representations of the video, then uses a conditional diffusion model to reconstruct frames from these semantics. To further improve compression, we characterize motion information with global camera trajectories and foreground segmentation: background motion is compactly represented by camera pose parameters while foreground dynamics by sparse segmentation masks. This allows for significantly boosts compression efficiency, enabling descent video reconstruction at extremely low bit-rates.
