Bi-Directional Deep Contextual Video Compression

Xihua Sheng; Li Li; Dong Liu; Shiqi Wang

Bi-Directional Deep Contextual Video Compression

Xihua Sheng, Li Li, Dong Liu, Shiqi Wang

TL;DR

This work tackles the limited performance of deep B-frame coding by introducing DCVC-B, a bi-directional deep contextual video compression framework designed for B-frames. It integrates bi-directional motion difference context propagation, bi-directional temporal context mining, and hierarchical quality structure-based training to enable efficient bit allocation and better use of bi-directional information within a GOP of 32 frames. Key contributions include a novel motion difference propagation mechanism, multi-scale bi-directional temporal contexts, a context-conditioned encoder-decoder, an enhanced temporal entropy model, and a hierarchical training strategy, achieving substantial BD-rate reductions (average PSNR BD-rate of -26.6% and MS-SSIM BD-rate of -49.9% against HM-RA-GOP16) and sometimes surpassing VVC on RA configurations. The results demonstrate that precise handling of bi-directional information can push deep B-frame coding toward parity with traditional codecs, with practical implications for high-efficiency video compression in random-access scenarios.

Abstract

Deep video compression has made remarkable process in recent years, with the majority of advancements concentrated on P-frame coding. Although efforts to enhance B-frame coding are ongoing, their compression performance is still far behind that of traditional bi-directional video codecs. In this paper, we introduce a bi-directional deep contextual video compression scheme tailored for B-frames, termed DCVC-B, to improve the compression performance of deep B-frame coding. Our scheme mainly has three key innovations. First, we develop a bi-directional motion difference context propagation method for effective motion difference coding, which significantly reduces the bit cost of bi-directional motions. Second, we propose a bi-directional contextual compression model and a corresponding bi-directional temporal entropy model, to make better use of the multi-scale temporal contexts. Third, we propose a hierarchical quality structure-based training strategy, leading to an effective bit allocation across large groups of pictures (GOP). Experimental results show that our DCVC-B achieves an average reduction of 26.6% in BD-Rate compared to the reference software for H.265/HEVC under random access conditions. Remarkably, it surpasses the performance of the H.266/VVC reference software on certain test datasets under the same configuration. We anticipate our work can provide valuable insights and bring up deep B-frame coding to the next level.

Bi-Directional Deep Contextual Video Compression

TL;DR

Abstract

Paper Structure (34 sections, 16 equations, 17 figures, 7 tables)

This paper contains 34 sections, 16 equations, 17 figures, 7 tables.

Introduction
Related Work
Deep Video Compression for P-Frame
Deep Video Compression for B-Frame
Overview
GOP Structure
Bi-Directional Motion Estimation
Bi-Directional Motion Compression
Bi-Directional Temporal Context Mining
Bi-Directional Contextual Compression
Entropy Model
Methodology
Bi-directional Motion Difference Context Propagation
Bi-directional Temporal Context Mining
Bi-directional Contextual Compression
...and 19 more sections

Figures (17)

Figure 1: Overview of our proposed bi-directional deep contextual video compression scheme---DCVC-B. The motion estimation module estimates the bi-directional motion vectors ($v_{t\rightarrow f}$, $v_{t\rightarrow b}$) between the current frame $x_t$ and bi-directional reference frames ($\hat{x}_{f}$, $\hat{x}_{b}$) and also estimates the motion vector predictions ($v_{b\rightarrow f}$, $v_{f\rightarrow b}$) between ($\hat{x}_{f}$, $\hat{x}_{b}$). Then the motion vector differences (MVD) ($r_{t\rightarrow f}$, $r_{t\rightarrow b}$) between ($v_{t\rightarrow f}$, $v_{t\rightarrow b}$) and their predictions ($\frac{v_{b\rightarrow f}}{2}$, $\frac{v_{f\rightarrow b}}{2}$) are jointly compressed and decompressed by a motion encoder-decoder with our proposed bi-directional motion difference context propagation method. The reconstructed motion vectors ($\hat{v}_{t\rightarrow f}$, $\hat{v}_{t\rightarrow b}$) are used to perform bi-directional temporal context mining over the bi-directional reference features ($\hat{F}_{f}$, $\hat{F}_{b}$). The predicted bi-directional multi-scale temporal contexts ($C_f^0$, $C_f^1$, $C_f^2$), ($C_b^0$, $C_b^1$, $C_b^2$) are fed into a contextual encoder-decoder to help compress and decompress the current frame $x_t$. Before obtaining the reconstructed frame $\hat{x}_t$, we regard an intermediate feature $\hat{F}_t$ of the contextual decoder as the propagated reference feature.
Figure 2: Structure of the group of pictures (GOP) of our proposed DCVC-B scheme. Following the default random access configuration of reference software of H.266/VVC bross2021overview, we set the intra period and GOP size to 32. There are six temporal layers within a GOP. We assign different quality coefficients for the B-frames in different temporal layers to achieve a hierarchical quality structure.
Figure 3: Architecture of the motion encoder-decoder with our proposed bi-directional motion difference context propagation method. "RB" refers to residual block. "DB" refers to depth block li2023neural. "Subp" refers to the subpixel layer shi2016real. "MFA" refers to the motion feature adaptor.
Figure 4: Different types of reference information propagation.
Figure 5: Architecture of the bi-directional temporal context mining module. The "Convnet" is implemented by a convolutional layer and a residual block. The "ConNetDown" is implemented by a convolutional layer with stride 2 and a residual block. The "ConNetUP" is implemented by a subpixel layer and a residual block.
...and 12 more figures

Bi-Directional Deep Contextual Video Compression

TL;DR

Abstract

Bi-Directional Deep Contextual Video Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (17)