Table of Contents
Fetching ...

Fine-Grained Motion Compression and Selective Temporal Fusion for Neural B-Frame Video Coding

Xihua Sheng, Peilin Chen, Meng Wang, Li Zhang, Shiqi Wang, Dapeng Oliver Wu

TL;DR

This work designs a fine-grained motion compression method that achieves an average BD-rate reduction compared to the state-of-the-art neural B-frame codec, DCVC-B, and delivers comparable or even superior compression performance to the H.266/VVC reference software under random-access configurations.

Abstract

With the remarkable progress in neural P-frame video coding, neural B-frame coding has recently emerged as a critical research direction. However, most existing neural B-frame codecs directly adopt P-frame coding tools without adequately addressing the unique challenges of B-frame compression, leading to suboptimal performance. To bridge this gap, we propose novel enhancements for motion compression and temporal fusion for neural B-frame coding. First, we design a fine-grained motion compression method. This method incorporates an interactive dual-branch motion auto-encoder with per-branch adaptive quantization steps, which enables fine-grained compression of bi-directional motion vectors while accommodating their asymmetric bitrate allocation and reconstruction quality requirements. Furthermore, this method involves an interactive motion entropy model that exploits correlations between bi-directional motion latent representations by interactively leveraging partitioned latent segments as directional priors. Second, we propose a selective temporal fusion method that predicts bi-directional fusion weights to achieve discriminative utilization of bi-directional multi-scale temporal contexts with varying qualities. Additionally, this method introduces a hyperprior-based implicit alignment mechanism for contextual entropy modeling. By treating the hyperprior as a surrogate for the contextual latent representation, this mechanism implicitly mitigates the misalignment in the fused bi-directional temporal priors. Extensive experiments demonstrate that our proposed codec achieves an average BD-rate reduction of approximately 10% compared to the state-of-the-art neural B-frame codec, DCVC-B, and delivers comparable or even superior compression performance to the H.266/VVC reference software under random-access configurations.

Fine-Grained Motion Compression and Selective Temporal Fusion for Neural B-Frame Video Coding

TL;DR

This work designs a fine-grained motion compression method that achieves an average BD-rate reduction compared to the state-of-the-art neural B-frame codec, DCVC-B, and delivers comparable or even superior compression performance to the H.266/VVC reference software under random-access configurations.

Abstract

With the remarkable progress in neural P-frame video coding, neural B-frame coding has recently emerged as a critical research direction. However, most existing neural B-frame codecs directly adopt P-frame coding tools without adequately addressing the unique challenges of B-frame compression, leading to suboptimal performance. To bridge this gap, we propose novel enhancements for motion compression and temporal fusion for neural B-frame coding. First, we design a fine-grained motion compression method. This method incorporates an interactive dual-branch motion auto-encoder with per-branch adaptive quantization steps, which enables fine-grained compression of bi-directional motion vectors while accommodating their asymmetric bitrate allocation and reconstruction quality requirements. Furthermore, this method involves an interactive motion entropy model that exploits correlations between bi-directional motion latent representations by interactively leveraging partitioned latent segments as directional priors. Second, we propose a selective temporal fusion method that predicts bi-directional fusion weights to achieve discriminative utilization of bi-directional multi-scale temporal contexts with varying qualities. Additionally, this method introduces a hyperprior-based implicit alignment mechanism for contextual entropy modeling. By treating the hyperprior as a surrogate for the contextual latent representation, this mechanism implicitly mitigates the misalignment in the fused bi-directional temporal priors. Extensive experiments demonstrate that our proposed codec achieves an average BD-rate reduction of approximately 10% compared to the state-of-the-art neural B-frame codec, DCVC-B, and delivers comparable or even superior compression performance to the H.266/VVC reference software under random-access configurations.

Paper Structure

This paper contains 30 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The framework of our proposed neural B-frame video codec, which consists of two major contributions: a fine-grained motion compression method with an interactive dual-branch auto-encoder and an interactive motion entropy model, and a selective temporal fusion method employing a discriminative contextual encoder-decoder and a discriminative contextual entropy model.
  • Figure 2: Illustration of our proposed interactive dual-branch motion auto-encoder with per-branch adaptive quantization steps. "Q" refers to the rounding-based quantization operator. "AE" and "AD" refer to the arithmetic encoder and decoder, respectively. $q_{t\rightarrow f}^{m,e}$, $q_{t\rightarrow b}^{m,e}$, $q_{t\rightarrow f}^{m,d}$, and $q_{t\rightarrow b}^{m,d}$ are learnable quantization steps.
  • Figure 3: Illustration of our proposed interactive motion entropy model.
  • Figure 4: Illustration of our proposed contextual auto-encoder with bi-directional weighting-based context fusion. "Q" refers to the rounding-based quantization operator. "AE" and "AD" refer to the arithmetic encoder and decoder, respectively. $q_{t}^{c,e}$ and $q_{t}^{c,d}$ are learnable quantization steps.
  • Figure 5: Illustration of our proposed implicit alignment-based prior fusion.
  • ...and 6 more figures