Table of Contents
Fetching ...

HiNeRV: Video Compression with Hierarchical Encoding-based Neural Representation

Ho Man Kwan, Ge Gao, Fan Zhang, Andrew Gower, David Bull

TL;DR

HiNeRV advances implicit neural representations for video by introducing hierarchical encoding and a bilinear upsampling-based upscaling strategy, enabling a deep, wide network with greater capacity. It unifies frame-wise and patch-wise representations through overlapped patches and padding, increasing flexibility for hardware and memory constraints. A refined compression pipeline with adaptive pruning and quantization-aware training preserves reconstruction quality under model compression, yielding significant improvements over prior INR baselines and competitive results against conventional and learning-based codecs. The approach demonstrates substantial practical potential for INR-based video codecs, with the first INR-based method to surpass HEVC in MS-SSIM on evaluated datasets, though encoding speed remains a challenge to address in future work.

Abstract

Learning-based video compression is currently a popular research topic, offering the potential to compete with conventional standard video codecs. In this context, Implicit Neural Representations (INRs) have previously been used to represent and compress image and video content, demonstrating relatively high decoding speed compared to other methods. However, existing INR-based methods have failed to deliver rate quality performance comparable with the state of the art in video compression. This is mainly due to the simplicity of the employed network architectures, which limit their representation capability. In this paper, we propose HiNeRV, an INR that combines light weight layers with novel hierarchical positional encodings. We employs depth-wise convolutional, MLP and interpolation layers to build the deep and wide network architecture with high capacity. HiNeRV is also a unified representation encoding videos in both frames and patches at the same time, which offers higher performance and flexibility than existing methods. We further build a video codec based on HiNeRV and a refined pipeline for training, pruning and quantization that can better preserve HiNeRV's performance during lossy model compression. The proposed method has been evaluated on both UVG and MCL-JCV datasets for video compression, demonstrating significant improvement over all existing INRs baselines and competitive performance when compared to learning-based codecs (72.3% overall bit rate saving over HNeRV and 43.4% over DCVC on the UVG dataset, measured in PSNR).

HiNeRV: Video Compression with Hierarchical Encoding-based Neural Representation

TL;DR

HiNeRV advances implicit neural representations for video by introducing hierarchical encoding and a bilinear upsampling-based upscaling strategy, enabling a deep, wide network with greater capacity. It unifies frame-wise and patch-wise representations through overlapped patches and padding, increasing flexibility for hardware and memory constraints. A refined compression pipeline with adaptive pruning and quantization-aware training preserves reconstruction quality under model compression, yielding significant improvements over prior INR baselines and competitive results against conventional and learning-based codecs. The approach demonstrates substantial practical potential for INR-based video codecs, with the first INR-based method to surpass HEVC in MS-SSIM on evaluated datasets, though encoding speed remains a challenge to address in future work.

Abstract

Learning-based video compression is currently a popular research topic, offering the potential to compete with conventional standard video codecs. In this context, Implicit Neural Representations (INRs) have previously been used to represent and compress image and video content, demonstrating relatively high decoding speed compared to other methods. However, existing INR-based methods have failed to deliver rate quality performance comparable with the state of the art in video compression. This is mainly due to the simplicity of the employed network architectures, which limit their representation capability. In this paper, we propose HiNeRV, an INR that combines light weight layers with novel hierarchical positional encodings. We employs depth-wise convolutional, MLP and interpolation layers to build the deep and wide network architecture with high capacity. HiNeRV is also a unified representation encoding videos in both frames and patches at the same time, which offers higher performance and flexibility than existing methods. We further build a video codec based on HiNeRV and a refined pipeline for training, pruning and quantization that can better preserve HiNeRV's performance during lossy model compression. The proposed method has been evaluated on both UVG and MCL-JCV datasets for video compression, demonstrating significant improvement over all existing INRs baselines and competitive performance when compared to learning-based codecs (72.3% overall bit rate saving over HNeRV and 43.4% over DCVC on the UVG dataset, measured in PSNR).
Paper Structure (29 sections, 3 equations, 11 figures, 11 tables)

This paper contains 29 sections, 3 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: (Left) Visual comparison between HNeRV chen2023hnerv and HiNeRV (ours) for compressed content (cropped). HiNeRV offers improved visual quality with approximately half the bit rate compared to HNeRV (PSNR and bitrate values are for the whole sequence). (Right) Video regression with different epochs for a representation task. HiNeRV (ours) with only 37 epochs achieves similar reconstruction quality to HNeRV chen2023hnerv with 300 epochs.
  • Figure 2: Top: The HiNeRV architecture. Bottom left: The HiNeRV block. HiNeRV block take feature maps $X_{n-1}$ and patch index $(i, j, t)$ as input, upsample the feature maps, enhances it with the hierarchical encoding, then computes the transformed maps $X_{n}$. Bottom right: The local grid. In HiNeRV, the hierarchical encoding is computed by performing interpolation from the local grid, where the modulo of the coordinates is being used.
  • Figure 3: Video compression results on the UVG mercat2020uvg and the MCL-JCV datasets wang2016mcl.
  • Figure 4: Illustration of the proposed HiNeRV models employing frame-based or patch-based representation.
  • Figure 5: Video compression results on the UVG datasets mercat2020uvg.
  • ...and 6 more figures