M3-CVC: Controllable Video Compression with Multimodal Generative Models
Rui Wan, Qi Zheng, Yibo Fan
TL;DR
The paper tackles ultra-low-bitrate video compression with controllability by introducing M3-CVC, which combines a semantic–motion keyframe selection strategy, LMM-based multi-round dialogue to produce hierarchical textual guidance ($D_i^F$, $D_i^C$), and diffusion-based, text-guided reconstruction for both keyframes and clips. It introduces a differentiable keyframe decision function $D(F_n,F_l)$ and a VAE-based keyframe encoder with discrete latent indices $K^*$ quantized over a vocabulary, all guided by textual descriptions during decoding via conditional diffusion models. Experimental results on standard datasets show substantial BD-rate savings and improved semantic fidelity over VVC, especially at ultra-low bitrates, demonstrating practical potential for bandwidth-constrained video applications. The approach provides interpretable control through LMM prompts and leverages GPU-accelerated diffusion to enable efficient, controllable video reconstruction.
Abstract
Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For each keyframe and its corresponding video clip, a dialogue-based large multimodal model (LMM) approach extracts hierarchical spatiotemporal details, enabling both inter-frame and intra-frame representations for improved video fidelity while enhancing encoding interpretability. M3-CVC further employs a conditional diffusion-based, text-guided keyframe compression method, achieving high fidelity in frame reconstruction. During decoding, textual descriptions derived from LMMs guide the diffusion process to restore the original video's content accurately. Experimental results demonstrate that M3-CVC significantly outperforms the state-of-the-art VVC standard in ultra-low bitrate scenarios, particularly in preserving semantic and perceptual fidelity.
