Table of Contents
Fetching ...

Accelerating Learned Video Compression via Low-Resolution Representation Learning

Zidian Qiu, Zongyao He, Zhi Jin

TL;DR

This work targets practical learned video compression by addressing its latency bottlenecks through low-resolution representation learning. It introduces an efficiency-optimized framework that repositions high-cost, high-resolution operations to low-resolution space, reuses decoded features including I-frames, and leverages multi-frame priors and online encoder updates. The approach achieves competitive rate–distortion performance against traditional codecs and substantially faster encoding/decoding than prior neural codecs, with 1080p decoding on RTX 2080Ti under 100 ms. These results demonstrate the potential for real-time, neural video codecs and provide design guidelines for balancing compression quality and inference speed.

Abstract

In recent years, the field of learned video compression has witnessed rapid advancement, exemplified by the latest neural video codecs DCVC-DC that has outperformed the upcoming next-generation codec ECM in terms of compression ratio. Despite this, learned video compression frameworks often exhibit low encoding and decoding speeds primarily due to their increased computational complexity and unnecessary high-resolution spatial operations, which hugely hinder their applications in reality. In this work, we introduce an efficiency-optimized framework for learned video compression that focuses on low-resolution representation learning, aiming to significantly enhance the encoding and decoding speeds. Firstly, we diminish the computational load by reducing the resolution of inter-frame propagated features obtained from reused features of decoded frames, including I-frames. We implement a joint training strategy for both the I-frame and P-frame models, further improving the compression ratio. Secondly, our approach efficiently leverages multi-frame priors for parameter prediction, minimizing computation at the decoding end. Thirdly, we revisit the application of the Online Encoder Update (OEU) strategy for high-resolution sequences, achieving notable improvements in compression ratio without compromising decoding efficiency. Our efficiency-optimized framework has significantly improved the balance between compression ratio and speed for learned video compression. In comparison to traditional codecs, our method achieves performance levels on par with the low-decay P configuration of the H.266 reference software VTM. Furthermore, when contrasted with DCVC-HEM, our approach delivers a comparable compression ratio while boosting encoding and decoding speeds by a factor of 3 and 7, respectively. On RTX 2080Ti, our method can decode each 1080p frame under 100ms.

Accelerating Learned Video Compression via Low-Resolution Representation Learning

TL;DR

This work targets practical learned video compression by addressing its latency bottlenecks through low-resolution representation learning. It introduces an efficiency-optimized framework that repositions high-cost, high-resolution operations to low-resolution space, reuses decoded features including I-frames, and leverages multi-frame priors and online encoder updates. The approach achieves competitive rate–distortion performance against traditional codecs and substantially faster encoding/decoding than prior neural codecs, with 1080p decoding on RTX 2080Ti under 100 ms. These results demonstrate the potential for real-time, neural video codecs and provide design guidelines for balancing compression quality and inference speed.

Abstract

In recent years, the field of learned video compression has witnessed rapid advancement, exemplified by the latest neural video codecs DCVC-DC that has outperformed the upcoming next-generation codec ECM in terms of compression ratio. Despite this, learned video compression frameworks often exhibit low encoding and decoding speeds primarily due to their increased computational complexity and unnecessary high-resolution spatial operations, which hugely hinder their applications in reality. In this work, we introduce an efficiency-optimized framework for learned video compression that focuses on low-resolution representation learning, aiming to significantly enhance the encoding and decoding speeds. Firstly, we diminish the computational load by reducing the resolution of inter-frame propagated features obtained from reused features of decoded frames, including I-frames. We implement a joint training strategy for both the I-frame and P-frame models, further improving the compression ratio. Secondly, our approach efficiently leverages multi-frame priors for parameter prediction, minimizing computation at the decoding end. Thirdly, we revisit the application of the Online Encoder Update (OEU) strategy for high-resolution sequences, achieving notable improvements in compression ratio without compromising decoding efficiency. Our efficiency-optimized framework has significantly improved the balance between compression ratio and speed for learned video compression. In comparison to traditional codecs, our method achieves performance levels on par with the low-decay P configuration of the H.266 reference software VTM. Furthermore, when contrasted with DCVC-HEM, our approach delivers a comparable compression ratio while boosting encoding and decoding speeds by a factor of 3 and 7, respectively. On RTX 2080Ti, our method can decode each 1080p frame under 100ms.
Paper Structure (15 sections, 7 figures, 7 tables)

This paper contains 15 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Rate-speed comparison on UVG dataset. Methods closer to the top left provide better performance and efficiency. OEU represents the online encoder update strategy. Tested on RTX 2080Ti using 1080p as the input.
  • Figure 2: Overview of our framework, where red lines represent data flows not included in the decoder side, while blue lines indicate data flows on the decoder side. $x_t$ and $\hat{x}_t$ represent the input frame and reconstructed frame, respectively. $MV_t$, $mv_t$ and $\hat{mv}_t$ denote the optical flow, motion vector and reconstructed motion vector, respectively. $y_t$ and $\hat{y}_t$ represent the latent representation and reconstructed intermediate feature in the contextual decoder, and $ctx_t$ is the learned temporal context, where the superscript $n\times$ denotes that the downsampling factor is $n$ relative to the input resolution.
  • Figure 3: Inference latency composition of different approaches. Tested on V100 using 1080p as the input.
  • Figure 4: Proposed motion encoder utilizing multi-frame priors.
  • Figure 5: Proposed multi-scale context fusion module.
  • ...and 2 more figures