Table of Contents
Fetching ...

High-Frequency Enhanced Hybrid Neural Representation for Video Compression

Li Yu, Zhihui Li, Jimin Xiao, Moncef Gabbouj

TL;DR

This work tackles the deficiency of high-frequency texture in implicit neural representations for video by introducing a High-Frequency Enhanced Hybrid Neural Representation Network. It integrates a Wavelet Frequency Decomposer-based Wavelet High-Frequency Encoder, a High-Frequency Feature Modulation fusion in the decoder, a Harmonic upsampling activation, and a Dynamic Weighted Frequency Loss to preserve fine details during reconstruction. Experiments on Bunny and UVG show consistent improvements in texture fidelity and rate-distortion performance, outperforming NeRV, E-NeRV, HNeRV, end-to-end neural codecs, and traditional codecs under various model sizes. The approach offers a practical path toward higher-quality INR-based video compression, though it currently emphasizes spatial high-frequency information and could benefit from incorporating temporal cues like optical flow.

Abstract

Neural Representations for Videos (NeRV) have simplified the video codec process and achieved swift decoding speeds by encoding video content into a neural network, presenting a promising solution for video compression. However, existing work overlooks the crucial issue that videos reconstructed by these methods lack high-frequency details. To address this problem, this paper introduces a High-Frequency Enhanced Hybrid Neural Representation Network. Our method focuses on leveraging high-frequency information to improve the synthesis of fine details by the network. Specifically, we design a wavelet high-frequency encoder that incorporates Wavelet Frequency Decomposer (WFD) blocks to generate high-frequency feature embeddings. Next, we design the High-Frequency Feature Modulation (HFM) block, which leverages the extracted high-frequency embeddings to enhance the fitting process of the decoder. Finally, with the refined Harmonic decoder block and a Dynamic Weighted Frequency Loss, we further reduce the potential loss of high-frequency information. Experiments on the Bunny and UVG datasets demonstrate that our method outperforms other methods, showing notable improvements in detail preservation and compression performance.

High-Frequency Enhanced Hybrid Neural Representation for Video Compression

TL;DR

This work tackles the deficiency of high-frequency texture in implicit neural representations for video by introducing a High-Frequency Enhanced Hybrid Neural Representation Network. It integrates a Wavelet Frequency Decomposer-based Wavelet High-Frequency Encoder, a High-Frequency Feature Modulation fusion in the decoder, a Harmonic upsampling activation, and a Dynamic Weighted Frequency Loss to preserve fine details during reconstruction. Experiments on Bunny and UVG show consistent improvements in texture fidelity and rate-distortion performance, outperforming NeRV, E-NeRV, HNeRV, end-to-end neural codecs, and traditional codecs under various model sizes. The approach offers a practical path toward higher-quality INR-based video compression, though it currently emphasizes spatial high-frequency information and could benefit from incorporating temporal cues like optical flow.

Abstract

Neural Representations for Videos (NeRV) have simplified the video codec process and achieved swift decoding speeds by encoding video content into a neural network, presenting a promising solution for video compression. However, existing work overlooks the crucial issue that videos reconstructed by these methods lack high-frequency details. To address this problem, this paper introduces a High-Frequency Enhanced Hybrid Neural Representation Network. Our method focuses on leveraging high-frequency information to improve the synthesis of fine details by the network. Specifically, we design a wavelet high-frequency encoder that incorporates Wavelet Frequency Decomposer (WFD) blocks to generate high-frequency feature embeddings. Next, we design the High-Frequency Feature Modulation (HFM) block, which leverages the extracted high-frequency embeddings to enhance the fitting process of the decoder. Finally, with the refined Harmonic decoder block and a Dynamic Weighted Frequency Loss, we further reduce the potential loss of high-frequency information. Experiments on the Bunny and UVG datasets demonstrate that our method outperforms other methods, showing notable improvements in detail preservation and compression performance.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Frames and their corresponding frequency maps. Row 1: Original frame alongside frames reconstructed by the HNeRV and our proposed method. Row 2: Frequency maps of the corresponding reconstructed frames, where frequency increases outward from the center. Our method shows better spatial and spectral alignment with the original frame.
  • Figure 2: Overview of our proposed High-Frequency Enhanced Hybrid Neural Representation Network. We encode the content embedding $e_t^c$ and high-frequency embedding $e_t^h$ via content encoder and wavelet high-frequency encoder, respectively. On the decoder, we use the proposed Harmonic block to upsample embeddings and use HFM block to fuse $e_t^c$ and $e_t^h$. In the end, a header layer converts the features into the final reconstructed image $\hat{x}_t$. The decoder and embeddings serve as the neural representation of a given video.
  • Figure 3: (Left) Illustration of frequency components obtained from Haar wavelet transformation. The Haar wavelet transform decomposes the input feature $F$ into four sub-bands: $F_{LL}$, $F_{LH}$, $F_{HL}$, and $F_{HH}$. Each transformation reduces the spatial resolution by half. (Right) Illustration of Wavelet Frequency Decomposer block.
  • Figure 4: Illustration of the High-Frequency Feature Modulation block (HFM). The HFM learns modulation vectors from high-frequency features to modulate the content features. are then passed through a feed-forward network to further enhance feature representation.
  • Figure 5: Video reconstruction visualization of UVG dataset. The first column is the ground truth, the second and third columns are the reconstruction result of E-NeRV and HNeRV, and the fourth column is the reconstruction result of our method. Our approach demonstrates superior performance in preserving texture structure.
  • ...and 2 more figures