Table of Contents
Fetching ...

Geometric Transformation-Embedded Mamba for Learned Video Compression

Hao Wei, Yanhui Zhou, Chenyang Ge

TL;DR

This work introduces a streamlined yet effective video compression framework founded on a direct transform strategy, i.e., nonlinear transform, quantization, and entropy coding, that outperforms state-of-the-art video compression approaches in terms of perceptual quality and temporal consistency under low-bitrate constraints.

Abstract

Although learned video compression methods have exhibited outstanding performance, most of them typically follow a hybrid coding paradigm that requires explicit motion estimation and compensation, resulting in a complex solution for video compression. In contrast, we introduce a streamlined yet effective video compression framework founded on a direct transform strategy, i.e., nonlinear transform, quantization, and entropy coding. We first develop a cascaded Mamba module (CMM) with different embedded geometric transformations to effectively explore both long-range spatial and temporal dependencies. To improve local spatial representation, we introduce a locality refinement feed-forward network (LRFFN) that incorporates a hybrid convolution block based on difference convolutions. We integrate the proposed CMM and LRFFN into the encoder and decoder of our compression framework. Moreover, we present a conditional channel-wise entropy model that effectively utilizes conditional temporal priors to accurately estimate the probability distributions of current latent features. Extensive experiments demonstrate that our method outperforms state-of-the-art video compression approaches in terms of perceptual quality and temporal consistency under low-bitrate constraints. Our source codes and models will be available at https://github.com/cshw2021/GTEM-LVC.

Geometric Transformation-Embedded Mamba for Learned Video Compression

TL;DR

This work introduces a streamlined yet effective video compression framework founded on a direct transform strategy, i.e., nonlinear transform, quantization, and entropy coding, that outperforms state-of-the-art video compression approaches in terms of perceptual quality and temporal consistency under low-bitrate constraints.

Abstract

Although learned video compression methods have exhibited outstanding performance, most of them typically follow a hybrid coding paradigm that requires explicit motion estimation and compensation, resulting in a complex solution for video compression. In contrast, we introduce a streamlined yet effective video compression framework founded on a direct transform strategy, i.e., nonlinear transform, quantization, and entropy coding. We first develop a cascaded Mamba module (CMM) with different embedded geometric transformations to effectively explore both long-range spatial and temporal dependencies. To improve local spatial representation, we introduce a locality refinement feed-forward network (LRFFN) that incorporates a hybrid convolution block based on difference convolutions. We integrate the proposed CMM and LRFFN into the encoder and decoder of our compression framework. Moreover, we present a conditional channel-wise entropy model that effectively utilizes conditional temporal priors to accurately estimate the probability distributions of current latent features. Extensive experiments demonstrate that our method outperforms state-of-the-art video compression approaches in terms of perceptual quality and temporal consistency under low-bitrate constraints. Our source codes and models will be available at https://github.com/cshw2021/GTEM-LVC.
Paper Structure (25 sections, 9 equations, 12 figures, 1 table)

This paper contains 25 sections, 9 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Architectural comparisons with existing video compression methods: (a) shows a hybrid coding framework; (b) is a transform-based coding method that is both frame-independent and latent-dependent; (c) illustrates the proposed coding approach that is frame-dependent and latent-dependent.
  • Figure 2: The overall architecture of the proposed video compression framework. To capture both long-range spatio-temporal dependencies and local dependencies, we develop the cascaded Mamba module (CMM) for global modeling and locality refinement feed-forward network (LRFFN) for local modeling. The CMM is implemented by sequentially traversing video features in four directional orders, namely forward spatio-temporal (FST), backward spatio-temporal (BST), forward temporal–spatial (FTS), and backward temporal–spatial (BTS).
  • Figure 3: Detailed architecture of geometric transformation Mamba block and different selective scanning methods. We first extract spatio-temporal video patches and then perform selective scanning along spatial and temporal dimension, depending on the priority assigned to each dimension.
  • Figure 4: Detailed architecture of the proposed locality refinement feed-forward network.
  • Figure 5: Detailed architecture of the proposed conditional channel-wise entropy model.
  • ...and 7 more figures