Table of Contents
Fetching ...

SNeRV: Spectra-preserving Neural Representation for Video

Jina Kim, Jihoo Lee, Je-Won Kang

TL;DR

SNeRV tackles spectral bias in implicit video representations by decomposing video signals with a 2D discrete wavelet transform, embedding only low-frequency content, and letting a specialized decoder reconstruct high-frequency textures. A multi-resolution fusion unit and a high-frequency restorer enable compact, detail-rich reconstructions, while a temporal extension using 3D wavelet decomposition and temporally extended up-sampling blocks captures cross-frame correlations. Across Bunny, UVG, and DAVIS datasets, SNeRV outperforms prior NeRV models in video regression and offers strong interpolation and compression performance, all within a fixed model budget. This work provides a principled frequency-domain approach to implicit video representation, with practical implications for efficient video storage, transmission, and reconstruction quality.

Abstract

Neural representation for video (NeRV), which employs a neural network to parameterize video signals, introduces a novel methodology in video representations. However, existing NeRV-based methods have difficulty in capturing fine spatial details and motion patterns due to spectral bias, in which a neural network learns high-frequency (HF) components at a slower rate than low-frequency (LF) components. In this paper, we propose spectra-preserving NeRV (SNeRV) as a novel approach to enhance implicit video representations by efficiently handling various frequency components. SNeRV uses 2D discrete wavelet transform (DWT) to decompose video into LF and HF features, preserving spatial structures and directly addressing the spectral bias issue. To balance the compactness, we encode only the LF components, while HF components that include fine textures are generated by a decoder. Specialized modules, including a multi-resolution fusion unit (MFU) and a high-frequency restorer (HFR), are integrated into a backbone to facilitate the representation. Furthermore, we extend SNeRV to effectively capture temporal correlations between adjacent video frames, by casting the extension as additional frequency decomposition to a temporal domain. This approach allows us to embed spatio-temporal LF features into the network, using temporally extended up-sampling blocks (TUBs). Experimental results demonstrate that SNeRV outperforms existing NeRV models in capturing fine details and achieves enhanced reconstruction, making it a promising approach in the field of implicit video representations. The codes are available at https://github.com/qwertja/SNeRV.

SNeRV: Spectra-preserving Neural Representation for Video

TL;DR

SNeRV tackles spectral bias in implicit video representations by decomposing video signals with a 2D discrete wavelet transform, embedding only low-frequency content, and letting a specialized decoder reconstruct high-frequency textures. A multi-resolution fusion unit and a high-frequency restorer enable compact, detail-rich reconstructions, while a temporal extension using 3D wavelet decomposition and temporally extended up-sampling blocks captures cross-frame correlations. Across Bunny, UVG, and DAVIS datasets, SNeRV outperforms prior NeRV models in video regression and offers strong interpolation and compression performance, all within a fixed model budget. This work provides a principled frequency-domain approach to implicit video representation, with practical implications for efficient video storage, transmission, and reconstruction quality.

Abstract

Neural representation for video (NeRV), which employs a neural network to parameterize video signals, introduces a novel methodology in video representations. However, existing NeRV-based methods have difficulty in capturing fine spatial details and motion patterns due to spectral bias, in which a neural network learns high-frequency (HF) components at a slower rate than low-frequency (LF) components. In this paper, we propose spectra-preserving NeRV (SNeRV) as a novel approach to enhance implicit video representations by efficiently handling various frequency components. SNeRV uses 2D discrete wavelet transform (DWT) to decompose video into LF and HF features, preserving spatial structures and directly addressing the spectral bias issue. To balance the compactness, we encode only the LF components, while HF components that include fine textures are generated by a decoder. Specialized modules, including a multi-resolution fusion unit (MFU) and a high-frequency restorer (HFR), are integrated into a backbone to facilitate the representation. Furthermore, we extend SNeRV to effectively capture temporal correlations between adjacent video frames, by casting the extension as additional frequency decomposition to a temporal domain. This approach allows us to embed spatio-temporal LF features into the network, using temporally extended up-sampling blocks (TUBs). Experimental results demonstrate that SNeRV outperforms existing NeRV models in capturing fine details and achieves enhanced reconstruction, making it a promising approach in the field of implicit video representations. The codes are available at https://github.com/qwertja/SNeRV.
Paper Structure (28 sections, 3 equations, 16 figures, 14 tables)

This paper contains 28 sections, 3 equations, 16 figures, 14 tables.

Figures (16)

  • Figure 1: Visual comparisons of the reconstructed HF coefficients of previous NeRV methods Hnervzhao2023dnerv and the proposed method in "Jockey" sequence. Our model is designed to efficiently encode fine details, by implicitly circumventing the spectral bias problem.
  • Figure 1: Encoding complexity comparisons in UVG datasets.
  • Figure 2: SNeRV backbone encoder and decoder architectures. The encoder applies 2D DWT to extract LF and HF features and embeds only the LF feature to save parameters. The decoder uses MFU and HFR to efficiently process the LF and HF features. CT and RB refer to transposed convolution and residual blocks, respectively.
  • Figure 2: Qualitative results of video regression task on Breakdance dataset at t=28 (top) and t=49 (bottom).
  • Figure 3: Temporal extension from the backbone: the encoder uses additional 1D DWT to generate spatio-temporal embeddings. The decoder uses TUBs to address the features.
  • ...and 11 more figures