PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos

Qi Zhao; M. Salman Asif; Zhan Ma

PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos

Qi Zhao, M. Salman Asif, Zhan Ma

TL;DR

PNeRV addresses spatial inconsistency in neural video representations by introducing a pyramidal decoder that fuses multi-scale information through a lightweight Kronecker Fully-connected (KFc) layer and a gated Benign Selective Memory (BSM). The approach is grounded in a Universal Approximation Theory analysis of NeRV, arguing that hierarchical shortcuts enable efficient, global-context learning with fewer parameters. Empirical results on UVG and DAVIS show consistent improvements in PSNR, SSIM, LPIPS, and FVD over state-of-the-art NeRV models, validating both the theoretical and architectural advances. The work highlights the practical potential of structured multi-scale neural representations for high-quality, coherent video reconstruction, while acknowledging increased computational complexity as an area for future optimization.

Abstract

The primary focus of Neural Representation for Videos (NeRV) is to effectively model its spatiotemporal consistency. However, current NeRV systems often face a significant issue of spatial inconsistency, leading to decreased perceptual quality. To address this issue, we introduce the Pyramidal Neural Representation for Videos (PNeRV), which is built on a multi-scale information connection and comprises a lightweight rescaling operator, Kronecker Fully-connected layer (KFc), and a Benign Selective Memory (BSM) mechanism. The KFc, inspired by the tensor decomposition of the vanilla Fully-connected layer, facilitates low-cost rescaling and global correlation modeling. BSM merges high-level features with granular ones adaptively. Furthermore, we provide an analysis based on the Universal Approximation Theory of the NeRV system and validate the effectiveness of the proposed PNeRV.We conducted comprehensive experiments to demonstrate that PNeRV surpasses the performance of contemporary NeRV models, achieving the best results in video regression on UVG and DAVIS under various metrics (PSNR, SSIM, LPIPS, and FVD). Compared to vanilla NeRV, PNeRV achieves a +4.49 dB gain in PSNR and a 231% increase in FVD on UVG, along with a +3.28 dB PSNR and 634% FVD increase on DAVIS.

PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos

TL;DR

Abstract

Paper Structure (18 sections, 1 theorem, 12 equations, 4 figures, 7 tables)

This paper contains 18 sections, 1 theorem, 12 equations, 4 figures, 7 tables.

Introduction
Related Work
Pyramidal Neural Representation for Videos
Kronecker Fully-connected Layer
Benign Selective Memory
Overall Structure
Universal Approximation Theory on NeRV
Basic Definitions and Notations
Implicit Neural Video Coding
UAT Analysis of Cascaded NeRV Model
UAT Analysis of PNeRV
Experiment
Video Regression on UVG
Video Regression on DAVIS
Ablation Studies
...and 3 more sections

Key Result

Theorem 1

For a cascaded NeRV system to $\epsilon$-approximate a video $V$ which is implicitly characterized by a certain unknown L-Lipschitz continuous function $\mathcal{F}: K \to \mathbb{R}^{d_{out}}$ where $K \subseteq \mathbb{R}^{d_{in}}$ is a compact set, then the upper bound of the minimal parameter qu

Figures (4)

Figure 1: High-quality video ($1920 \times 960$) reconstruction comparisons between the proposed Pyramidal NeRV and other models, PSNR in yellow. PNeRV outperforms other models on perceptual quality with less noise and artifacts, maintaining spatial consistency.
Figure 2: Visualized comparison between PixelShuffle and KFc, where $\times$ denotes matrix multiplication and black box is the subpixel area. PixelShuffle fills the subpixels using a local receptive field, lacking long-range relationship modeling ability, while KFc calculates the correlation between every position.
Figure 3: The overall architecture of PNeRV, consists of KFc and BSM. The right part shows the comparison of parameters and FLOPs between PixelShuffle (PS) and KFc, where input feature maps are in $c \times h \times w$, the upscaling rate is $r$ and kernel size in PS is $k \times k$.
Figure 4: Visual comparison on various videos. "Bmx" has larger motion, "Elephant" has massive droplets blurring, "Parkour" involves both camera rotation and extreme dynamics, "Dance" contains large motion under high-frequency reed leaves. "Jockey", "ReadyS", and "ShakeN" are videos with complex spatiotemporal correlation in UVG. Zoom in for a detailed comparison.

Theorems & Definitions (7)

Definition 1
Definition 2
Definition 3
Remark 1
Theorem 1
Remark 2
Remark 3

PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos

TL;DR

Abstract

PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)