Table of Contents
Fetching ...

Tree-NeRV: A Tree-Structured Neural Representation for Efficient Non-Uniform Video Encoding

Jiancheng Zhao, Yifan Zhan, Qingtian Zhu, Mingze Ma, Muyao Niu, Zunian Wan, Xiang Ji, Yinqiang Zheng

TL;DR

Tree-NeRV tackles the inefficiency of uniform temporal sampling in implicit neural video representations by introducing a Binary Search Tree (BST)-based tree-structured feature grid that enables non-uniform, adaptive sampling along the video timeline. An optimization-driven training strategy grows the tree to allocate more samples to high-variation regions, while AVL balancing maintains efficient queries. The method couples a BST-based time embedding with cascaded NeRV blocks, achieving state-of-the-art reconstruction quality and competitive RD performance across standard datasets, along with faster encoding/decoding relative to several baselines. Empirically, Tree-NeRV delivers notable PSNR gains, aligns sampling with temporal dynamics, and maintains practical encoding/decoding efficiency, with potential for further improvements via pruning strategies in future work.

Abstract

Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance. To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.

Tree-NeRV: A Tree-Structured Neural Representation for Efficient Non-Uniform Video Encoding

TL;DR

Tree-NeRV tackles the inefficiency of uniform temporal sampling in implicit neural video representations by introducing a Binary Search Tree (BST)-based tree-structured feature grid that enables non-uniform, adaptive sampling along the video timeline. An optimization-driven training strategy grows the tree to allocate more samples to high-variation regions, while AVL balancing maintains efficient queries. The method couples a BST-based time embedding with cascaded NeRV blocks, achieving state-of-the-art reconstruction quality and competitive RD performance across standard datasets, along with faster encoding/decoding relative to several baselines. Empirically, Tree-NeRV delivers notable PSNR gains, aligns sampling with temporal dynamics, and maintains practical encoding/decoding efficiency, with potential for further improvements via pruning strategies in future work.

Abstract

Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance. To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.

Paper Structure

This paper contains 29 sections, 15 equations, 14 figures, 13 tables, 1 algorithm.

Figures (14)

  • Figure 1: For a video sequence, uniform sampling tends to under-represent to the high-dynamic regions and waste bitrate in the low-dynamic ones. Whereas, our Tree Structure Sampling is better suited to this uneven distribution of temporal redundancies in video sequences.
  • Figure 2: Overview of Tree-NeRV.. (a) Each node in the tree $T$ structure consists of a temporal key $k$, a feature value $v$, and its left and right subtrees, $T_L$ and $T_R$, respectively. Given an query temporal index $t_i$, Tree-NeRV searches for the lower and upper bound $(k_i^l,v_i^l)$ and $(k_i^u,v_i^u)$, then performs linear interpolation between them to obtain the corresponding time embedding $v_i$ (\ref{['sec:tree']}); (b) The time embedding $v_i$ is then processed through cascaded NeRV blocks to upsample and generate the final prediction $\hat{x}_i$ (\ref{['sec:NeRV']}). (c) During training, an optimization-driven tree-growing and resampling strategy is employed to adaptively learn the temporal redundancy distribution of the video, allocating higher sampling density to regions with greater temporal variation (\ref{['sec:tree growing']}). .
  • Figure 3: (a) and (b) illustrate two types of rotation operations used for rebalancing. For simplicity, nodes are represented by their keys. Red nodes indicate unbalanced nodes, while blue nodes represent balanced nodes. Dashed lines depict connections that have been modified during the rotation process.
  • Figure 4: Video representation results. Other methods failed to capture certain details, such as the digits on the scoreboard in 'Jockey' (Top) and the intricate wing structure of the honeybee in 'Honeybee' (Bottom). In contrast, our method effectively captured and reconstructed these fine details.
  • Figure 5: Sampling results of our Tree-NeRV. In the figure, the red line represents the temporal variation of the video sequence, quantified by the mean squared error (MSE) between adjacent frames. The blue line depicts the probability density function (PDF) of Tree-NeRV's actual sampling points. It is evident that Tree-NeRV's sampling density aligns closely with the temporal variation trends of the video sequence.
  • ...and 9 more figures