Table of Contents
Fetching ...

Fast Encoding and Decoding for Implicit Video Representation

Hao Chen, Saining Xie, Ser-Nam Lim, Abhinav Shrivastava

TL;DR

Two key components are introduced: NeRV-Enc, a transformer-based hyper-network for fast encoding; and NeRV-Dec, a parallel decoder for efficient video loading.

Abstract

Despite the abundant availability and content richness for video data, its high-dimensionality poses challenges for video research. Recent advancements have explored the implicit representation for videos using neural networks, demonstrating strong performance in applications such as video compression and enhancement. However, the prolonged encoding time remains a persistent challenge for video Implicit Neural Representations (INRs). In this paper, we focus on improving the speed of video encoding and decoding within implicit representations. We introduce two key components: NeRV-Enc, a transformer-based hyper-network for fast encoding; and NeRV-Dec, a parallel decoder for efficient video loading. NeRV-Enc achieves an impressive speed-up of $\mathbf{10^4\times}$ by eliminating gradient-based optimization. Meanwhile, NeRV-Dec simplifies video decoding, outperforming conventional codecs with a loading speed $\mathbf{11\times}$ faster, and surpassing RAM loading with pre-decoded videos ($\mathbf{2.5\times}$ faster while being $\mathbf{65\times}$ smaller in size).

Fast Encoding and Decoding for Implicit Video Representation

TL;DR

Two key components are introduced: NeRV-Enc, a transformer-based hyper-network for fast encoding; and NeRV-Dec, a parallel decoder for efficient video loading.

Abstract

Despite the abundant availability and content richness for video data, its high-dimensionality poses challenges for video research. Recent advancements have explored the implicit representation for videos using neural networks, demonstrating strong performance in applications such as video compression and enhancement. However, the prolonged encoding time remains a persistent challenge for video Implicit Neural Representations (INRs). In this paper, we focus on improving the speed of video encoding and decoding within implicit representations. We introduce two key components: NeRV-Enc, a transformer-based hyper-network for fast encoding; and NeRV-Dec, a parallel decoder for efficient video loading. NeRV-Enc achieves an impressive speed-up of by eliminating gradient-based optimization. Meanwhile, NeRV-Dec simplifies video decoding, outperforming conventional codecs with a loading speed faster, and surpassing RAM loading with pre-decoded videos ( faster while being smaller in size).
Paper Structure (19 sections, 3 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 19 sections, 3 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Left: video encoding for implicit video representations. NeRV-Enc is $10^4\times$ faster than NeRV chen2021nerv baseline (with gradient-based optimization). * uses a larger encoder and more training videos. Right: video decoding. NeRV-Dec decodes videos $8.9\times$ faster than NeRV and $11\times$ faster than H.264. It is even $2.5\times$ faster than loading pre-decoded videos from RAM while being $65\times$ smaller in video size.
  • Figure 2: Top: video encoding. NeRV-Enc processes the input video $x$ to get video-specific weights $\hat{\theta}'$ using the hyper-network. Bottom: video decoding. NeRV-Dec generates final NeRV weights $\theta'$ and reconstruct video $\hat{x}$.
  • Figure 3: Weight token distributions across layers. Left: Uniform (TransINR chen2022transinr). Middle: Layer-specific (GINR kim2022scalable). Right: Layer-adaptive (ours).
  • Figure 4: Left Generate video-specific weights $\hat{\theta}'$ via the hyper-network. Right Generate NeRV weights $\theta'$ by element-wise multiplication of $\hat{\theta}'$ and video-agnostic weights $\theta_1$.
  • Figure 6: Visualizations for INR encoding methods: TransINR chen2022transinr (Top), GINR kim2022generalizable (Middle), and NeRV-Enc (Bottom, ours). Our method excels in reconstructing videos with superior fidelity and fine details. Best viewed digitally and zoomed in.
  • ...and 4 more figures