Table of Contents
Fetching ...

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

Yuan Tian, Guo Lu, Guangtao Zhai

TL;DR

This work proposes to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs by introducing a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs.

Abstract

Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression

TL;DR

This work proposes to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs by introducing a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs.

Abstract

Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.
Paper Structure (11 sections, 2 equations, 5 figures, 5 tables)

This paper contains 11 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Recent video semantic compression paradigms, where (a) deploying plain codecs like VVC directly to downstream tasks yields inferior results, (b) Adapting learnable codecs with task-specific supervisions, such as action recognition, requires pre-deployment training and data labels, yet performs poorly on other tasks like multiple object tracking (MOT), thus unpractical, (c) prior unsupervised methods driven by self-supervised learning (SSL) exhibit undesirable results due to the limited learned semantics, and (d) our approach, driven by pre-trained visual foundation models (VFMs), achieves strong results by absorbing their rich semantics. Evaluation is conducted on UCF101@0.02bpp and MOT17@0.01bpp, utilizing TSM lin2019tsm and ByteTrack zhang2021bytetrack, respectively. Network architectures of (b)/(c)/(d) are consistent for a fair comparison.
  • Figure 2: Overview of Free-VSC framework, which learns to absorb rich semantics from multiple VFMs into the compression procedure. A prompt-based semantic alignment layer (Prom-SAL) is introduced to flexibly align the compressed video feature $\hat{f}$ to the semantic space of VFMs. We also propose a trajectory-based entropy model for efficiently compressing the inter-frame semantic redundancy. We illustrate two VFMs ($V_1$ and $V_2$) and three semantic trajectories for simplicity, although more VFMs and trajectories can be applied in our approach.
  • Figure 3: Semantic compression performance on Action Recognition, MOT and VOS tasks. The plot titles are in {Dataset}-{Model} format.
  • Figure 4: Qualitative comparison between the compressed frame by different methods. The frame is from HEVC Class C dataset. The numbers in parentheses indicates the compression ratio.
  • Figure 5: (a) Ablation on the framework. (b) Effectiveness of introducing VFMs semantics to other approaches. (c) Comparison of different entropy models. 'wo' denotes without this component.