Table of Contents
Fetching ...

M3-CVC: Controllable Video Compression with Multimodal Generative Models

Rui Wan, Qi Zheng, Yibo Fan

TL;DR

The paper tackles ultra-low-bitrate video compression with controllability by introducing M3-CVC, which combines a semantic–motion keyframe selection strategy, LMM-based multi-round dialogue to produce hierarchical textual guidance ($D_i^F$, $D_i^C$), and diffusion-based, text-guided reconstruction for both keyframes and clips. It introduces a differentiable keyframe decision function $D(F_n,F_l)$ and a VAE-based keyframe encoder with discrete latent indices $K^*$ quantized over a vocabulary, all guided by textual descriptions during decoding via conditional diffusion models. Experimental results on standard datasets show substantial BD-rate savings and improved semantic fidelity over VVC, especially at ultra-low bitrates, demonstrating practical potential for bandwidth-constrained video applications. The approach provides interpretable control through LMM prompts and leverages GPU-accelerated diffusion to enable efficient, controllable video reconstruction.

Abstract

Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For each keyframe and its corresponding video clip, a dialogue-based large multimodal model (LMM) approach extracts hierarchical spatiotemporal details, enabling both inter-frame and intra-frame representations for improved video fidelity while enhancing encoding interpretability. M3-CVC further employs a conditional diffusion-based, text-guided keyframe compression method, achieving high fidelity in frame reconstruction. During decoding, textual descriptions derived from LMMs guide the diffusion process to restore the original video's content accurately. Experimental results demonstrate that M3-CVC significantly outperforms the state-of-the-art VVC standard in ultra-low bitrate scenarios, particularly in preserving semantic and perceptual fidelity.

M3-CVC: Controllable Video Compression with Multimodal Generative Models

TL;DR

The paper tackles ultra-low-bitrate video compression with controllability by introducing M3-CVC, which combines a semantic–motion keyframe selection strategy, LMM-based multi-round dialogue to produce hierarchical textual guidance (, ), and diffusion-based, text-guided reconstruction for both keyframes and clips. It introduces a differentiable keyframe decision function and a VAE-based keyframe encoder with discrete latent indices quantized over a vocabulary, all guided by textual descriptions during decoding via conditional diffusion models. Experimental results on standard datasets show substantial BD-rate savings and improved semantic fidelity over VVC, especially at ultra-low bitrates, demonstrating practical potential for bandwidth-constrained video applications. The approach provides interpretable control through LMM prompts and leverages GPU-accelerated diffusion to enable efficient, controllable video reconstruction.

Abstract

Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For each keyframe and its corresponding video clip, a dialogue-based large multimodal model (LMM) approach extracts hierarchical spatiotemporal details, enabling both inter-frame and intra-frame representations for improved video fidelity while enhancing encoding interpretability. M3-CVC further employs a conditional diffusion-based, text-guided keyframe compression method, achieving high fidelity in frame reconstruction. During decoding, textual descriptions derived from LMMs guide the diffusion process to restore the original video's content accurately. Experimental results demonstrate that M3-CVC significantly outperforms the state-of-the-art VVC standard in ultra-low bitrate scenarios, particularly in preserving semantic and perceptual fidelity.

Paper Structure

This paper contains 12 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of proposed M3-CVC framework
  • Figure 2: Multi-round dialogue-based visual information extraction strategy
  • Figure 3: Overview of generative image codec for keyframe compression
  • Figure 4: The R-D performance evaluation with LPIPS and CLIP-sim performed on HEVC Class B, Class C, UVG, and MCL-JCV Datasets.
  • Figure 5: Visual quality comparison between ground truth Video, VVC and M3-CVC