Table of Contents
Fetching ...

UCVC: A Unified Contextual Video Compression Framework with Joint P-frame and B-frame Coding

Jiayu Yang, Wei Jiang, Yongqi Zhai, Chunhui Yang, Ronggang Wang

TL;DR

The paper tackles the challenge of flexible video compression by enabling unified P-frame and B-frame coding within a single learned framework. It introduces UCVC, which uses two neighboring decoded frames as references and jointly trains for both P- and B-frame scenarios, achieving comparable efficiency to frame-type-specific methods. The model employs a conditional coding pipeline with motion estimation, temporal context mining, and a mean-scale hyperprior, optimized via a rate–distortion objective $Loss = R + \lambda D$ and a strategic frame-type allocation across GoPs. Empirical results on CLIC validation and benchmark datasets show that frame-type selection per sequence yields BD-rate savings and competitive performance against traditional codecs and state-of-the-art learned methods, highlighting the practical value of adaptive frame-type strategies in learned video compression.

Abstract

This paper presents a learned video compression method in response to video compression track of the 6th Challenge on Learned Image Compression (CLIC), at DCC 2024.Specifically, we propose a unified contextual video compression framework (UCVC) for joint P-frame and B-frame coding. Each non-intra frame refers to two neighboring decoded frames, which can be either both from the past for P-frame compression, or one from the past and one from the future for B-frame compression. In training stage, the model parameters are jointly optimized with both P-frames and B-frames. Benefiting from the designs, the framework can support both P-frame and B-frame coding and achieve comparable compression efficiency with that specifically designed for P-frame or B-frame.As for challenge submission, we report the optimal compression efficiency by selecting appropriate frame types for each test sequence. Our team name is PKUSZ-LVC.

UCVC: A Unified Contextual Video Compression Framework with Joint P-frame and B-frame Coding

TL;DR

The paper tackles the challenge of flexible video compression by enabling unified P-frame and B-frame coding within a single learned framework. It introduces UCVC, which uses two neighboring decoded frames as references and jointly trains for both P- and B-frame scenarios, achieving comparable efficiency to frame-type-specific methods. The model employs a conditional coding pipeline with motion estimation, temporal context mining, and a mean-scale hyperprior, optimized via a rate–distortion objective and a strategic frame-type allocation across GoPs. Empirical results on CLIC validation and benchmark datasets show that frame-type selection per sequence yields BD-rate savings and competitive performance against traditional codecs and state-of-the-art learned methods, highlighting the practical value of adaptive frame-type strategies in learned video compression.

Abstract

This paper presents a learned video compression method in response to video compression track of the 6th Challenge on Learned Image Compression (CLIC), at DCC 2024.Specifically, we propose a unified contextual video compression framework (UCVC) for joint P-frame and B-frame coding. Each non-intra frame refers to two neighboring decoded frames, which can be either both from the past for P-frame compression, or one from the past and one from the future for B-frame compression. In training stage, the model parameters are jointly optimized with both P-frames and B-frames. Benefiting from the designs, the framework can support both P-frame and B-frame coding and achieve comparable compression efficiency with that specifically designed for P-frame or B-frame.As for challenge submission, we report the optimal compression efficiency by selecting appropriate frame types for each test sequence. Our team name is PKUSZ-LVC.
Paper Structure (10 sections, 1 equation, 7 figures, 2 tables)

This paper contains 10 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Comparison on reference structure of P-frame (left) and B-frame (right). The dashed line represents a GoP of size 8 and the numbers indicate encoding order count.
  • Figure 2: Overview of our learned video compression framework. The first frame in each GoP is compressed as I-frame (left), while the others are compressed as P-frames or B-frames (right). Given current frame $x_{cur}$ at time step $t$, the two reference frames $\hat{x}_{ref1}$ and $\hat{x}_{ref2}$ are either both from the past, e.g., $\hat{x}_{t-1}$ and $\hat{x}_{t-2}$, for P-frame compression, or one from the past and one from the future, e.g., $\hat{x}_{t-1}$ and $\hat{x}_{t+1}$, for B-frame compression.
  • Figure 3: GoP structure in training stage, where each training sample contains both P-frames and B-frames.
  • Figure 4: RD-Curve in terms of PSNR on CLIC validation set at 0.05 mbps and 0.5 mbps. Results of different frame types are reported, where optimal represents selecting the frame type that achieves better compression efficiency for each sequence.
  • Figure 5: BD-Rate comparison in terms of PSNR on CLIC validation dataset at 0.05 mbps track. We set P-frame as anchor and report the result of B-frame and optimal frame type. The horizontal coordinates indicate the sequence index. We present video contents of several sequences, in which sequences 9, 18, 19 with slow and simple motion achieve bit rate savings on B-frame compression while sequences 8, 14, 26 with large motion, occlusion and camera motion suffer bit rate increments. By combining the optimal results of P-frame and B-frame coding, the framework can achieve better compression efficiency.
  • ...and 2 more figures