Table of Contents
Fetching ...

SMC++: Masked Learning of Unsupervised Video Semantic Compression

Yuan Tian, Xiaoyue Ling, Cong Geng, Qiang Hu, Guo Lu, Guangtao Zhai

TL;DR

This work addresses unsupervised video semantic compression (UVSC) by leveraging Masked Video Modeling (MVM) to preserve semantics while minimizing bitrate. It introduces Non-Semantics Suppressed (NSS) learning to reduce non-semantic entropy in the MVM token space, and formulates a principled objective that links semantic preservation to information-theoretic quantities. The authors present a simple SMC baseline and an advanced SMC++ with masked motion prediction and a Blueprint-guided compression Transformer (Blue-Tr), achieving state-of-the-art semantic performance across action recognition, MOT, and VOS on multiple datasets and codecs. The approach demonstrates strong cross-task generalization, robustness, and practical decoding efficiency, with potential for integration with downstream AI models including large language-model-based systems.

Abstract

Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets. \textit{Codes and model are available at: https://github.com/tianyuan168326/VideoSemanticCompression-Pytorch.

SMC++: Masked Learning of Unsupervised Video Semantic Compression

TL;DR

This work addresses unsupervised video semantic compression (UVSC) by leveraging Masked Video Modeling (MVM) to preserve semantics while minimizing bitrate. It introduces Non-Semantics Suppressed (NSS) learning to reduce non-semantic entropy in the MVM token space, and formulates a principled objective that links semantic preservation to information-theoretic quantities. The authors present a simple SMC baseline and an advanced SMC++ with masked motion prediction and a Blueprint-guided compression Transformer (Blue-Tr), achieving state-of-the-art semantic performance across action recognition, MOT, and VOS on multiple datasets and codecs. The approach demonstrates strong cross-task generalization, robustness, and practical decoding efficiency, with potential for integration with downstream AI models including large language-model-based systems.

Abstract

Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets. \textit{Codes and model are available at: https://github.com/tianyuan168326/VideoSemanticCompression-Pytorch.
Paper Structure (16 sections, 25 equations, 16 figures, 16 tables)

This paper contains 16 sections, 25 equations, 16 figures, 16 tables.

Figures (16)

  • Figure 1: Framework overview. First, the semantic features of the original and the lossy videos are separately extracted by Sem-Net. Then, the original semantic feature is compressed with the aid of the lossy semantic by Basic-CM/Blue-Tr. Finally, the reconstructed semantic feature and the lossy video are fused by F-Net, generating the semantically-sound video that supports various analysis tasks. The framework is optimized via a non-semantics suppressed Masked Video Modeling (MVM) objective. In the basic model SMC, the MVM task only predicts the pixels, meanwhile the semantics is compressed by a simple residue-based compression module (Basic-CM). In the improved model SMC++, a motion-prediction MVM objective is further incorporated for better temporal semantic modeling, and a powerful Blueprint-guided compression Transformer (Blue-Tr) is introduced. $\times$ denotes the gradient-stopping operation. Q denotes the quantization operation. Our framework can further support high-fidelity video decoding, by appending a detail rendering network (DR-Net).
  • Figure 2: Blueprint-guided compression Transformer (Blue-Tr) first extracts a blueprint semantic feature, which is then employed to align diverse available features. Finally, a decomposed Transformer compresses redundancies among the current frame's semantic feature and the aligned features. The feature aligner module (b) align different features by using the blueprint feature as the guidance. Q denotes the quantization operation. The current lossy semantics $\tilde{S}^t$ is also fed into the Blue-Encoder/Decoder for aiding the compression of the blueprint feature. We omit this in the figure for simplicity.
  • Figure 3: Visualization of the patch features learned by different semantic objectives, i.e., (a) vanilla MAE loss $\mathcal{L}_{MAE}$ and (b) our non-semantics suppressed MAE loss $\mathcal{L}_{Sem}$.
  • Figure 4: Semantic coding performance on Action Recognition (1st and 2nd rows), VOS (3rd row), and MOT (4th row) tasks. The plot titles are in {Dataset}-{Task Model} format. The codec setting is LDP mode with GOP size 10. The results on more advanced codec settings are provided in the supplementary material.
  • Figure 5: Comparison of our approach with two recent approaches, i.e., DeepSVC lin2023deepsvc and HVFVC li2023high, on action recognition and video object segmentation (VOS) tasks.
  • ...and 11 more figures