Table of Contents
Fetching ...

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, Long Xu, Haoliang Li, Nicu Sebe, Sam Kwong, Shiqi Wang

TL;DR

Preserving semantic content while compressing video at ultra-low bitrates is a key challenge for traditional codecs. CMVC addresses this by disentangling video into spatial content and motion, mapping each to multimodal representations via Multimodal Large Language Models (MLLMs), and enabling two decoding regimes: TT2V for semantic fidelity and IT2V for perceptual quality, with LoRA-tuned diffusion aiding frame interpolation. The approach introduces a keyframe-based encoder with cosine-similarity selection, a multimodal representation pipeline, and a decoder that supports flexible reconstruction under different bitrate constraints, backed by extensive experiments on standard benchmarks. The results demonstrate competitive semantic reconstruction (TT2V) and perceptual consistency (IT2V), illustrating the potential of combining MLLMs with cross-modality representations for efficient, flexible video coding in bandwidth-constrained scenarios.

Abstract

Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

TL;DR

Preserving semantic content while compressing video at ultra-low bitrates is a key challenge for traditional codecs. CMVC addresses this by disentangling video into spatial content and motion, mapping each to multimodal representations via Multimodal Large Language Models (MLLMs), and enabling two decoding regimes: TT2V for semantic fidelity and IT2V for perceptual quality, with LoRA-tuned diffusion aiding frame interpolation. The approach introduces a keyframe-based encoder with cosine-similarity selection, a multimodal representation pipeline, and a decoder that supports flexible reconstruction under different bitrate constraints, backed by extensive experiments on standard benchmarks. The results demonstrate competitive semantic reconstruction (TT2V) and perceptual consistency (IT2V), illustrating the potential of combining MLLMs with cross-modality representations for efficient, flexible video coding in bandwidth-constrained scenarios.

Abstract

Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.
Paper Structure (14 sections, 8 equations, 7 figures, 3 tables)

This paper contains 14 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The framework of the proposed CMVC scheme. This framework operates by first segmenting the video into distinct clips using a keyframe selection strategy (a), allowing for the extraction of both spatial and temporal components from each video segment. Subsequently, MLLMs are employed to generate multimodal representations of these components. For instance, spatial information can be represented through text or images, while temporal dynamics may be encoded using text or audio modalities. These multimodal representations are then encoded via their respective encoders, resulting in compressed bitstreams for each component. The bitstreams corresponding to different components are then combined and transmitted to the decoder. In the decoder, we provide two exemplary modes, including TT2V (c) and IT2V (d) modes, for video generation. This model integrate various SoTA models and mode conversions while maintaining semantic and perceptual quality at relatively high compression ratios.
  • Figure 2: The workflow of the IT2V generative model. Two LoRAs are trained to fit the two keyframe images ($I_{0}$ and $I_{1}$), respectively. To generate $w$-th frame between $I_0$ and $I_1$, we interpolate $I^{'}_{w}$ and the LoRA parameters according to the weights $w_i$ and $w_l$.
  • Figure 3: Left: Comparison results of combination of different V2T models (VideoLLaVA and VideoLLaMA) and TT2V models (VideoCrafter1, VideoCrafter2, ModelScope, OpenSora and AnimateDiff). Right: Visual quality comparison of the TT2V mode and VTM. At ULB, our proposed TT2V mode successfully preserves the semantic quality of the videos. In contrast, VTM brings significant blocking artifacts, which impedes the effective conveyance of semantic information in videos.
  • Figure 4: Visual quality comparison. The values represent the BPP (1e-2) and the DISTS value. A lower DISTS value indicates better perceptual quality.
  • Figure 5: The R-D performance comparison in the IT2V mode. The comparisons are performed on the Class B, Class C, Class D, Class E, UVG, and MCL-JCV, respectively.
  • ...and 2 more figures