Table of Contents
Fetching ...

TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

Namitha Padmanabhan, Matthew Gwilliam, Abhinav Shrivastava

TL;DR

This work proposes an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead, and is the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV.

Abstract

Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20$\times$; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3$\times$ faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at https://namithap10.github.io/teconerv/ .

TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

TL;DR

This work proposes an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead, and is the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV.

Abstract

Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3 faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at https://namithap10.github.io/teconerv/ .
Paper Structure (38 sections, 4 equations, 10 figures, 13 tables)

This paper contains 38 sections, 4 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: TeCoNeRV achieves smoother weight transitions across video clips, enabling superior compression. Bottom: Selected frames from consecutive clips of three videos from the UVG dataset, with the change in pixel space visualized in the middle column. Top: L2 norm of clip-to-clip weight residuals over time for the baseline (NeRV-Enc chen2024fast) versus our approach (TeCoNeRV). Our temporal coherence regularization produces smaller weight residuals as video content evolves, resulting in efficient compression while preserving visual quality.
  • Figure 2: Overview of TeCoNeRV. Above: Hypernetworks from prior work predict weights for entire video frames at once, resulting in large base parameters and bitstream size. Below: Our approach with (a) patch-tubelets that decouple spatial resolution from memory requirements, (b) residual encoding that stores only weight differences across time steps, and (c) temporal coherence finetuning that regularizes weight differences. Together, these components achieve better compression efficiency and superior reconstruction quality.
  • Figure 3: Quality vs. bitrate on UVG (left) and Kinetics-400 (right) at 480p. Rate–distortion curves showing PSNR vs. bpp for NeRV-Enc*, Patch-Tubelet and TeCoNeRV models. For each method, we show the Pareto frontier across two model size configurations. TeCoNeRV outperforms the baseline, achieving up to 2.47dB higher PSNR at comparable bitrates.
  • Figure 4: Qualitative comparison at 720p. We show 3 frames from UVG, HEVC Class E, and MCL-JCV datasets. Top to bottom: Ground truth, NeRV-Enc*, ours ("no overlap"), ours ("overlap with cropping"). Our method's reconstructions preserve structural details, while the baseline exhibits noticeable quality degradation. Our "overlap with cropping" patch fusion strategy eliminates the boundary artifacts in ours ("no overlap"). Best viewed digitally and zoomed in.
  • Figure 5: Rate-distortion and encoding speed vs. number of patches. Varying patch size and overlap produces different operating points at 720p (top) and 1080p (bottom) inference. Marker size corresponds to number of patches.
  • ...and 5 more figures