Table of Contents
Fetching ...

Fast-OMRA: Fast Online Motion Resolution Adaptation for Neural B-Frame Coding

Sang NguyenQuang, Zong-Lin Gao, Kuan-Wei Ho, Xiem HoangVan, Wen-Hsiao Peng

TL;DR

This work introduces lightweight classifiers to determine the downsampling factor and presents two variants that adopt binary and multi-class classifiers, respectively, that achieve comparable coding performance to the brute-force search methods while greatly reducing computational complexity.

Abstract

Most learned B-frame codecs with hierarchical temporal prediction suffer from the domain shift issue caused by the discrepancy in the Group-of-Pictures (GOP) size used for training and test. As such, the motion estimation network may fail to predict large motion properly. One effective strategy to mitigate this domain shift issue is to downsample video frames for motion estimation. However, finding the optimal downsampling factor involves a time-consuming rate-distortion optimization process. This work introduces lightweight classifiers to determine the downsampling factor. To strike a good rate-distortion-complexity trade-off, our classifiers observe simple state signals, including only the coding and reference frames, to predict the best downsampling factor. We present two variants that adopt binary and multi-class classifiers, respectively. The binary classifier adopts the Focal Loss for training, classifying between motion estimation at high and low resolutions. Our multi-class classifier is trained with novel soft labels incorporating the knowledge of the rate-distortion costs of different downsampling factors. Both variants operate as add-on modules without the need to re-train the B-frame codec. Experimental results confirm that they achieve comparable coding performance to the brute-force search methods while greatly reducing computational complexity.

Fast-OMRA: Fast Online Motion Resolution Adaptation for Neural B-Frame Coding

TL;DR

This work introduces lightweight classifiers to determine the downsampling factor and presents two variants that adopt binary and multi-class classifiers, respectively, that achieve comparable coding performance to the brute-force search methods while greatly reducing computational complexity.

Abstract

Most learned B-frame codecs with hierarchical temporal prediction suffer from the domain shift issue caused by the discrepancy in the Group-of-Pictures (GOP) size used for training and test. As such, the motion estimation network may fail to predict large motion properly. One effective strategy to mitigate this domain shift issue is to downsample video frames for motion estimation. However, finding the optimal downsampling factor involves a time-consuming rate-distortion optimization process. This work introduces lightweight classifiers to determine the downsampling factor. To strike a good rate-distortion-complexity trade-off, our classifiers observe simple state signals, including only the coding and reference frames, to predict the best downsampling factor. We present two variants that adopt binary and multi-class classifiers, respectively. The binary classifier adopts the Focal Loss for training, classifying between motion estimation at high and low resolutions. Our multi-class classifier is trained with novel soft labels incorporating the knowledge of the rate-distortion costs of different downsampling factors. Both variants operate as add-on modules without the need to re-train the B-frame codec. Experimental results confirm that they achieve comparable coding performance to the brute-force search methods while greatly reducing computational complexity.

Paper Structure

This paper contains 15 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of our compression system. (a) The coding framework of MaskCRT B-frame, where $x_t$ is the current coding B-frame and $\hat{x}_{t-k}, \hat{x}_{t+k}$ are the two previously reconstructed reference frames. $S$ represents the downsampling factor, which takes on 1, 2, 4, or 8. (b) The proposed network for Bi-Class and Mu-Class to decide the downsampling factor. It takes $x_t$, $\hat{x}_{t-k}, \hat{x}_{t+k}$ as the inputs to predict the downsampling factor $S$.
  • Figure 2: The rate-distortion performance comparison. The anchor is MaskCRT B-frame.
  • Figure 3: The rate-distortion-complexity trade-offs.