Table of Contents
Fetching ...

NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines

Jukka I. Ahonen, Nam Le, Honglei Zhang, Antti Hallapuro, Francesco Cricri, Hamed Rezazadegan Tavakoli, Miska M. Hannuksela, Esa Rahtu

TL;DR

NN-VVC addresses the need for efficient video coding tailored to machine analysis by marrying a self-supervised, learned intra-frame codec (LIC) with a traditional, high-performance inter-frame codec (VVC). The system introduces two adapters, IHA and IMA, to refine intra-frames for reference and to tailor inter-frames for machine tasks, plus a fallback path and spatial resampling to handle low-bit-rate regimes. Empirically, NN-VVC achieves substantial BD-rate reductions and improved task performance across multiple vision tasks and datasets, outperforming VVC on both image and video data while maintaining bitstream interoperability. This hybrid approach provides a practical pathway to machine-oriented video coding, with potential extensions to higher resolutions and joint optimization for human and machine viewing.

Abstract

The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two almost orthogonal directions. The first is represented by end-to-end (E2E) learned codecs which, while offering high performance on image coding, are not yet on par with state-of-the-art conventional video codecs and lack interoperability. The second direction considers using the Versatile Video Coding (VVC) standard or any other conventional video codec (CVC) together with pre- and post-processing operations targeting machine analysis. While the CVC-based methods benefit from interoperability and broad hardware and software support, the machine task performance is often lower than the desired level, particularly in low bitrates. This paper proposes a hybrid codec for machines called NN-VVC, which combines the advantages of an E2E-learned image codec and a CVC to achieve high performance in both image and video coding for machines. Our experiments show that the proposed system achieved up to -43.20% and -26.8% Bjøntegaard Delta rate reduction over VVC for image and video data, respectively, when evaluated on multiple different datasets and machine vision tasks. To the best of our knowledge, this is the first research paper showing a hybrid video codec that outperforms VVC on multiple datasets and multiple machine vision tasks.

NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines

TL;DR

NN-VVC addresses the need for efficient video coding tailored to machine analysis by marrying a self-supervised, learned intra-frame codec (LIC) with a traditional, high-performance inter-frame codec (VVC). The system introduces two adapters, IHA and IMA, to refine intra-frames for reference and to tailor inter-frames for machine tasks, plus a fallback path and spatial resampling to handle low-bit-rate regimes. Empirically, NN-VVC achieves substantial BD-rate reductions and improved task performance across multiple vision tasks and datasets, outperforming VVC on both image and video data while maintaining bitstream interoperability. This hybrid approach provides a practical pathway to machine-oriented video coding, with potential extensions to higher resolutions and joint optimization for human and machine viewing.

Abstract

The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two almost orthogonal directions. The first is represented by end-to-end (E2E) learned codecs which, while offering high performance on image coding, are not yet on par with state-of-the-art conventional video codecs and lack interoperability. The second direction considers using the Versatile Video Coding (VVC) standard or any other conventional video codec (CVC) together with pre- and post-processing operations targeting machine analysis. While the CVC-based methods benefit from interoperability and broad hardware and software support, the machine task performance is often lower than the desired level, particularly in low bitrates. This paper proposes a hybrid codec for machines called NN-VVC, which combines the advantages of an E2E-learned image codec and a CVC to achieve high performance in both image and video coding for machines. Our experiments show that the proposed system achieved up to -43.20% and -26.8% Bjøntegaard Delta rate reduction over VVC for image and video data, respectively, when evaluated on multiple different datasets and machine vision tasks. To the best of our knowledge, this is the first research paper showing a hybrid video codec that outperforms VVC on multiple datasets and multiple machine vision tasks.
Paper Structure (16 sections, 4 equations, 5 figures, 3 tables)

This paper contains 16 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The NN-VVC coding system, light blue color indicates a neural network component
  • Figure 2: The learned image codec (LIC) for intra-frames.
  • Figure 3: The output from VVC compared to our proposed codec. The input sequence PartyScene_832x480_50_val was coded with IntraPeriod = 32, QP=52. Our codec at the same bitrate managed to preserve details of the foreground objects better. The strong edges in the background are also more visible.
  • Figure 4: Evaluation results across multiple vision tasks (Object detection, instance segmentation and multiple object tracking) on multiple datasets (Open Images, TVD image, TVD video and SFU video), in comparison to VVC as the baseline.
  • Figure 5: Comparison of the predicted bounding boxes from VVC and NN-VVC reconstructed inter frames on TVD-03 video. Blue, red and green bounding boxes represent common correct predictions, missing/incorrect predictions only on VVC reconstructed frames, and correct predictions only on NN-VVC reconstructed frames, respectively. Note that all the red bounding boxes are missing predictions, except the largest one in the top left image.