Table of Contents
Fetching ...

Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Lingdong Wang, Guan-Ming Su, Divya Kothandaraman, Tsung-Wei Huang, Mohammad Hajiesmaili, Ramesh K. Sitaraman

TL;DR

The paper introduces DiSCo, a semantic video compression framework that factorizes a video into a textual description, a spatiotemporally degraded video, and optional sketches or poses, then reconstructs high-quality content via a conditional diffusion model. By employing multimodal encoding, token interleaving, and in-context LoRA adaptation of a video diffusion transformer, DiSCo achieves substantial perceptual gains at ultra-low bitrates, outperforming both traditional codecs and prior semantic approaches. The approach demonstrates robustness across multiple benchmarks and provides detailed ablations validating the benefits of forward filling and modality-specific codecs. This work presents a practical, scalable paradigm for perceptually driven video compression that leverages generative priors to optimize for human perception rather than pixel fidelity alone.

Abstract

Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

TL;DR

The paper introduces DiSCo, a semantic video compression framework that factorizes a video into a textual description, a spatiotemporally degraded video, and optional sketches or poses, then reconstructs high-quality content via a conditional diffusion model. By employing multimodal encoding, token interleaving, and in-context LoRA adaptation of a video diffusion transformer, DiSCo achieves substantial perceptual gains at ultra-low bitrates, outperforming both traditional codecs and prior semantic approaches. The approach demonstrates robustness across multiple benchmarks and provides detailed ablations validating the benefits of forward filling and modality-specific codecs. This work presents a practical, scalable paradigm for perceptually driven video compression that leverages generative priors to optimize for human perception rather than pixel fidelity alone.

Abstract

Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

Paper Structure

This paper contains 16 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Overview of proposed method. Red means trainable module, blue means frozen module, yellow means non-learning operations.
  • Figure 2: Conditioning on sketch/pose modality at 0.005 BPP.
  • Figure 3: Workflow of the degraded video modality.
  • Figure 4: Illustration of token interleaving.
  • Figure 5: Modality mixture caused by frame interleaving.
  • ...and 5 more figures