Bi-Directional Deep Contextual Video Compression
Xihua Sheng, Li Li, Dong Liu, Shiqi Wang
TL;DR
This work tackles the limited performance of deep B-frame coding by introducing DCVC-B, a bi-directional deep contextual video compression framework designed for B-frames. It integrates bi-directional motion difference context propagation, bi-directional temporal context mining, and hierarchical quality structure-based training to enable efficient bit allocation and better use of bi-directional information within a GOP of 32 frames. Key contributions include a novel motion difference propagation mechanism, multi-scale bi-directional temporal contexts, a context-conditioned encoder-decoder, an enhanced temporal entropy model, and a hierarchical training strategy, achieving substantial BD-rate reductions (average PSNR BD-rate of -26.6% and MS-SSIM BD-rate of -49.9% against HM-RA-GOP16) and sometimes surpassing VVC on RA configurations. The results demonstrate that precise handling of bi-directional information can push deep B-frame coding toward parity with traditional codecs, with practical implications for high-efficiency video compression in random-access scenarios.
Abstract
Deep video compression has made remarkable process in recent years, with the majority of advancements concentrated on P-frame coding. Although efforts to enhance B-frame coding are ongoing, their compression performance is still far behind that of traditional bi-directional video codecs. In this paper, we introduce a bi-directional deep contextual video compression scheme tailored for B-frames, termed DCVC-B, to improve the compression performance of deep B-frame coding. Our scheme mainly has three key innovations. First, we develop a bi-directional motion difference context propagation method for effective motion difference coding, which significantly reduces the bit cost of bi-directional motions. Second, we propose a bi-directional contextual compression model and a corresponding bi-directional temporal entropy model, to make better use of the multi-scale temporal contexts. Third, we propose a hierarchical quality structure-based training strategy, leading to an effective bit allocation across large groups of pictures (GOP). Experimental results show that our DCVC-B achieves an average reduction of 26.6% in BD-Rate compared to the reference software for H.265/HEVC under random access conditions. Remarkably, it surpasses the performance of the H.266/VVC reference software on certain test datasets under the same configuration. We anticipate our work can provide valuable insights and bring up deep B-frame coding to the next level.
