Table of Contents
Fetching ...

Neural B-Frame Coding: Tackling Domain Shift Issues with Lightweight Online Motion Resolution Adaptation

Sang NguyenQuang, Xiem HoangVan, Wen-Hsiao Peng

TL;DR

The paper tackles domain-shift in learned B-frame codecs caused by training on short GOPs but deploying on longer GOPs. It introduces Fast-OMRA, a family of lightweight online motion resolution adaptation methods that predict the downsampling factor $S\in\{1,2,4,8\}$ without re-training, via Bi-Class, Mu-Class, and Co-Class variants (employing Focal Loss, soft RD-based labels, and a hybrid search). The methods achieve coding performance close to exhaustive search while significantly reducing encoding complexity, with Co-Class offering the best trade-off by combining predictive accuracy and selective search. Validation on MaskCRT-B across multiple datasets demonstrates robust RD-complexity improvements and practical applicability to learned B-frame codecs.

Abstract

Learned B-frame codecs with hierarchical temporal prediction often encounter the domain-shift issue due to mismatches between the Group-of-Pictures (GOP) sizes for training and testing, leading to inaccurate motion estimates, particularly for large motion. A common solution is to turn large motion into small motion by downsampling video frames during motion estimation. However, determining the optimal downsampling factor typically requires costly rate-distortion optimization. This work introduces lightweight classifiers to predict downsampling factors. These classifiers leverage simple state signals from current and reference frames to balance rate-distortion performance with computational cost. Three variants are proposed: (1) a binary classifier (Bi-Class) trained with Focal Loss to choose between high and low resolutions, (2) a multi-class classifier (Mu-Class) trained with novel soft labels based on rate-distortion costs, and (3) a co-class approach (Co-Class) that combines the predictive capability of the multi-class classifier with the selective search of the binary classifier. All classifier methods can work seamlessly with existing B-frame codecs without requiring codec retraining. Experimental results show that they achieve coding performance comparable to exhaustive search methods while significantly reducing computational complexity. The code is available at: https://github.com/NYCU-MAPL/Fast-OMRA.git.

Neural B-Frame Coding: Tackling Domain Shift Issues with Lightweight Online Motion Resolution Adaptation

TL;DR

The paper tackles domain-shift in learned B-frame codecs caused by training on short GOPs but deploying on longer GOPs. It introduces Fast-OMRA, a family of lightweight online motion resolution adaptation methods that predict the downsampling factor without re-training, via Bi-Class, Mu-Class, and Co-Class variants (employing Focal Loss, soft RD-based labels, and a hybrid search). The methods achieve coding performance close to exhaustive search while significantly reducing encoding complexity, with Co-Class offering the best trade-off by combining predictive accuracy and selective search. Validation on MaskCRT-B across multiple datasets demonstrates robust RD-complexity improvements and practical applicability to learned B-frame codecs.

Abstract

Learned B-frame codecs with hierarchical temporal prediction often encounter the domain-shift issue due to mismatches between the Group-of-Pictures (GOP) sizes for training and testing, leading to inaccurate motion estimates, particularly for large motion. A common solution is to turn large motion into small motion by downsampling video frames during motion estimation. However, determining the optimal downsampling factor typically requires costly rate-distortion optimization. This work introduces lightweight classifiers to predict downsampling factors. These classifiers leverage simple state signals from current and reference frames to balance rate-distortion performance with computational cost. Three variants are proposed: (1) a binary classifier (Bi-Class) trained with Focal Loss to choose between high and low resolutions, (2) a multi-class classifier (Mu-Class) trained with novel soft labels based on rate-distortion costs, and (3) a co-class approach (Co-Class) that combines the predictive capability of the multi-class classifier with the selective search of the binary classifier. All classifier methods can work seamlessly with existing B-frame codecs without requiring codec retraining. Experimental results show that they achieve coding performance comparable to exhaustive search methods while significantly reducing computational complexity. The code is available at: https://github.com/NYCU-MAPL/Fast-OMRA.git.

Paper Structure

This paper contains 7 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The GOP structure in the training process (a) consists of only five frames, whereas the inference process (b) can extend up to 32 frames.
  • Figure 2: (a) Motion estimation and temporal warping with and without downsampling, (b) Online Motion Resolution Adaptation for neural B-frame coding, and (c) classification network architecture.
  • Figure 3: The conditional probability distributions of ground-truth downsampling factors given the predicted outcomes of Mu-Class on the BVI-DVC and TVD datasets.
  • Figure 4: The rate-distortion performance comparison for BasketballDrive, Jockey and videoSRC01 sequences.
  • Figure 5: (a) The rate-distortion-complexity trade-offs when Bi-Class (B), Mu-Class (M), and Co-Class (C) Fast-OMRA are applied to various temporal layers. The B (n), M (n), and C (n) configurations apply Bi-Class, Mu-Class, and Co-Class Fast-OMRA, respectively, to all video frames in temporal layers up to n, and (b) temporal complexity of video sequences versus BD-rate savings achieved by our Co-Class Fast-OMRA. The red and green points indicate the best and worst sequences in terms of BD-rates, respectively.
  • ...and 1 more figures