Table of Contents
Fetching ...

MetaNeRV: Meta Neural Representations for Videos with Spatial-Temporal Guidance

Jialong Guo, Ke liu, Jiangchao Yao, Zhihua Wang, Jiajun Bu, Haishuai Wang

TL;DR

MetaNeRV introduces a meta-learning framework to learn a high-quality initialization for image-wise implicit neural video representations, enabling rapid adaptation to unseen videos. By adding spatial guidance through multi-resolution supervision and temporal guidance via progressive inner-loop tasks, it improves both convergence speed and reconstruction quality, demonstrated across diverse real-world and medical datasets. The approach also extends to practical tasks like video denoising and compression, achieving competitive or superior results to traditional codecs and prior NeRV-based methods, including favorable rate-distortion performance and robustness to out-of-distribution data. Overall, MetaNeRV offers a scalable path to efficient, high-quality video representation and processing using learned priors over neural video representations.

Abstract

Neural Representations for Videos (NeRV) has emerged as a promising implicit neural representation (INR) approach for video analysis, which represents videos as neural networks with frame indexes as inputs. However, NeRV-based methods are time-consuming when adapting to a large number of diverse videos, as each video requires a separate NeRV model to be trained from scratch. In addition, NeRV-based methods spatially require generating a high-dimension signal (i.e., an entire image) from the input of a low-dimension timestamp, and a video typically consists of tens of frames temporally that have a minor change between adjacent frames. To improve the efficiency of video representation, we propose Meta Neural Representations for Videos, named MetaNeRV, a novel framework for fast NeRV representation for unseen videos. MetaNeRV leverages a meta-learning framework to learn an optimal parameter initialization, which serves as a good starting point for adapting to new videos. To address the unique spatial and temporal characteristics of video modality, we further introduce spatial-temporal guidance to improve the representation capabilities of MetaNeRV. Specifically, the spatial guidance with a multi-resolution loss aims to capture the information from different resolution stages, and the temporal guidance with an effective progressive learning strategy could gradually refine the number of fitted frames during the meta-learning process. Extensive experiments conducted on multiple datasets demonstrate the superiority of MetaNeRV for video representations and video compression.

MetaNeRV: Meta Neural Representations for Videos with Spatial-Temporal Guidance

TL;DR

MetaNeRV introduces a meta-learning framework to learn a high-quality initialization for image-wise implicit neural video representations, enabling rapid adaptation to unseen videos. By adding spatial guidance through multi-resolution supervision and temporal guidance via progressive inner-loop tasks, it improves both convergence speed and reconstruction quality, demonstrated across diverse real-world and medical datasets. The approach also extends to practical tasks like video denoising and compression, achieving competitive or superior results to traditional codecs and prior NeRV-based methods, including favorable rate-distortion performance and robustness to out-of-distribution data. Overall, MetaNeRV offers a scalable path to efficient, high-quality video representation and processing using learned priors over neural video representations.

Abstract

Neural Representations for Videos (NeRV) has emerged as a promising implicit neural representation (INR) approach for video analysis, which represents videos as neural networks with frame indexes as inputs. However, NeRV-based methods are time-consuming when adapting to a large number of diverse videos, as each video requires a separate NeRV model to be trained from scratch. In addition, NeRV-based methods spatially require generating a high-dimension signal (i.e., an entire image) from the input of a low-dimension timestamp, and a video typically consists of tens of frames temporally that have a minor change between adjacent frames. To improve the efficiency of video representation, we propose Meta Neural Representations for Videos, named MetaNeRV, a novel framework for fast NeRV representation for unseen videos. MetaNeRV leverages a meta-learning framework to learn an optimal parameter initialization, which serves as a good starting point for adapting to new videos. To address the unique spatial and temporal characteristics of video modality, we further introduce spatial-temporal guidance to improve the representation capabilities of MetaNeRV. Specifically, the spatial guidance with a multi-resolution loss aims to capture the information from different resolution stages, and the temporal guidance with an effective progressive learning strategy could gradually refine the number of fitted frames during the meta-learning process. Extensive experiments conducted on multiple datasets demonstrate the superiority of MetaNeRV for video representations and video compression.
Paper Structure (20 sections, 11 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 11 equations, 18 figures, 4 tables, 1 algorithm.

Figures (18)

  • Figure 1: (a) NeRV Network takes the frame index as input and outputs an image of that index. Querying a sequence of frame indexes results in a list of sequences, which can represent a video. (b) The network with random initialization necessitates optimization through numerous steps for new videos, whereas meta-learned initialization enables swift adaptation to new videos.
  • Figure 2: (left) The visualization results with three-step inference in video representation tasks between the E-NeRV method and our method of the different guidance, where TG, SG, NoG, and STG respectively represent Temporal Guidance, Spatial Guidance, No Guidance, and Spatio-Temporal Guidance. (middle) The average PSNR and training time curves on the UCF dataset under different guidance, where the model trains faster and performs better under spatio-temporal guidance. (right) The average PSNR and training time curves of E-NeRV and our method on four datasets, given a target PSNR value of 30.
  • Figure 3: Framework for MetaNeRV. A meta-learner is utilized to sample tasks of learning video and learns an initialized weight that can quickly fine-tune to a new video. The initialized weights will be cloned and then optimized m steps for n subtask in their corresponding video.
  • Figure 4: (a) NeRV network inputs a one-dimensional frame index, which expands through NeRV blocks to the image size, outputting corresponding frames. We propose adding a header block for spatial guidance at each NeRV block layer. (b) We propose a progressive training strategy for temporal guidance, gradually increasing video frame numbers in subtasks during meta-learning.
  • Figure 5: The visualization of NeRV, E-NeRV, FFNeRV, HNeRV, and MetaNeRV fitting the MCL$\_$JCV, HMDB-51, UCF101, EchoCP, and EchoNet-LVH examples. Notably, our method produces remarkable results in merely 3 iteration steps. "step 0" represents inference results directly from the initialization weight without further training.
  • ...and 13 more figures