Table of Contents
Fetching ...

EV-NVC: Efficient Variable bitrate Neural Video Compression

Yongcun Hu, Yingzhen Zhai, Jixiang Luo, Wenrui Dai, Dell Zhang, Hongkai Xiong, Xuelong Li

TL;DR

This paper tackles the challenge of training variable-rate neural video codecs by introducing EV-NVC, a framework that combines a Piecewise Linear Sampler (PLS) for effective rate control with a Long-Short-Term Feature Fusion Module (LSTFFM) to integrate long- and short-term context. A multi-stage, mixed-precision training strategy is used to optimize learning and evaluate component contributions, while motion estimation relies on a pre-trained SpyNet. The key contributions include the PLS with four idx segments and specific hyperparameters, the LSTFFM architecture that fuses long-term references like $\hat{x}_{t-4}$ with short-term features, and an 18-stage training regimen that progressively shapes motion, reconstruction, and multi-frame losses. Experimental results show BD-rate reductions up to 30.56% versus HM-16.25 and competitive performance with VTM-17.0 across HEVC classes, with ablation confirming substantial gains from both PLS and LSTFFM. Overall, EV-NVC provides a scalable, open-source approach to variable-rate neural video compression that can operate efficiently across diverse devices and applications.

Abstract

Training neural video codec (NVC) with variable rate is a highly challenging task due to its complex training strategies and model structure. In this paper, we train an efficient variable bitrate neural video codec (EV-NVC) with the piecewise linear sampler (PLS) to improve the rate-distortion performance in high bitrate range, and the long-short-term feature fusion module (LSTFFM) to enhance the context modeling. Besides, we introduce mixed-precision training and discuss the different training strategies for each stage in detail to fully evaluate its effectiveness. Experimental results show that our approach reduces the BD-rate by 30.56% compared to HM-16.25 within low-delay mode.

EV-NVC: Efficient Variable bitrate Neural Video Compression

TL;DR

This paper tackles the challenge of training variable-rate neural video codecs by introducing EV-NVC, a framework that combines a Piecewise Linear Sampler (PLS) for effective rate control with a Long-Short-Term Feature Fusion Module (LSTFFM) to integrate long- and short-term context. A multi-stage, mixed-precision training strategy is used to optimize learning and evaluate component contributions, while motion estimation relies on a pre-trained SpyNet. The key contributions include the PLS with four idx segments and specific hyperparameters, the LSTFFM architecture that fuses long-term references like with short-term features, and an 18-stage training regimen that progressively shapes motion, reconstruction, and multi-frame losses. Experimental results show BD-rate reductions up to 30.56% versus HM-16.25 and competitive performance with VTM-17.0 across HEVC classes, with ablation confirming substantial gains from both PLS and LSTFFM. Overall, EV-NVC provides a scalable, open-source approach to variable-rate neural video compression that can operate efficiently across diverse devices and applications.

Abstract

Training neural video codec (NVC) with variable rate is a highly challenging task due to its complex training strategies and model structure. In this paper, we train an efficient variable bitrate neural video codec (EV-NVC) with the piecewise linear sampler (PLS) to improve the rate-distortion performance in high bitrate range, and the long-short-term feature fusion module (LSTFFM) to enhance the context modeling. Besides, we introduce mixed-precision training and discuss the different training strategies for each stage in detail to fully evaluate its effectiveness. Experimental results show that our approach reduces the BD-rate by 30.56% compared to HM-16.25 within low-delay mode.

Paper Structure

This paper contains 11 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the proposed method. $x_{t}$ and $\hat{x}_{t}$ are current and reconstructed frame. $v_{t}$ and $\hat{v}_{t}$ are motion vector and decoded motion vector. $c_{t}$ is the conditional information. $q_{t}$ is the rate control parameter.
  • Figure 2: The structure of long-short-term feature fusion module (LSTFFM). Long-term extraction conv (LTFC) adopts the conv-LeakyReLU-conv structure for further feature extraction, and upsampling module is implemented using conv and pixelshuffle. $\mathrm{C}$ is concatentation operation and $\downarrow 2$ is convolutional layer with stride 2.
  • Figure 3: Rate-distortion curves for HEVC Class B, C, D. The comparison is in YUV420 colorspace. PSNR (6*PSNRY + PSNRU + PSNRV)/8 is used to evaluate the distortion of the decoded pictures. The most right figure is the rate-distortion curve for HEVC Class D from stage8 to stage18.