Table of Contents
Fetching ...

HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks

Maria Pilligua, Danna Xue, Javier Vazquez-Corral

TL;DR

HyperNVD introduces a meta-learning framework that uses a hypernetwork to generate parameters for a compact INR-based neural video decomposition model, enabling fast adaptation to unseen videos. A frozen VideoMAE encoder provides video embeddings that condition the hypernetwork to output per-video weights and MRHE parameters, allowing multi-video training and rapid fine-tuning while preserving reconstruction quality. Empirical results on the DAVIS dataset show competitive quantitative performance with prior methods and substantial speedups when adapting to new scenes (e.g., about $0.8$ dB improvement on unseen videos and roughly $30$ minutes faster) along with robust editing capabilities. The approach reduces overfitting to single videos and improves editing efficiency, offering practical benefits for professional video editing workflows.

Abstract

Decomposing a video into a layer-based representation is crucial for easy video editing for the creative industries, as it enables independent editing of specific layers. Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our model is based on a hypernetwork architecture which, given a video-encoder embedding, generates the parameters for a compact INR-based neural video decomposition model. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos. Our code is available at: https://hypernvd.github.io/

HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks

TL;DR

HyperNVD introduces a meta-learning framework that uses a hypernetwork to generate parameters for a compact INR-based neural video decomposition model, enabling fast adaptation to unseen videos. A frozen VideoMAE encoder provides video embeddings that condition the hypernetwork to output per-video weights and MRHE parameters, allowing multi-video training and rapid fine-tuning while preserving reconstruction quality. Empirical results on the DAVIS dataset show competitive quantitative performance with prior methods and substantial speedups when adapting to new scenes (e.g., about dB improvement on unseen videos and roughly minutes faster) along with robust editing capabilities. The approach reduces overfitting to single videos and improves editing efficiency, offering practical benefits for professional video editing workflows.

Abstract

Decomposing a video into a layer-based representation is crucial for easy video editing for the creative industries, as it enables independent editing of specific layers. Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our model is based on a hypernetwork architecture which, given a video-encoder embedding, generates the parameters for a compact INR-based neural video decomposition model. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos. Our code is available at: https://hypernvd.github.io/

Paper Structure

This paper contains 21 sections, 17 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Fine-tuning the video decomposition model from our metamodel (HyperNVD) versus training from scratch on unseen videos shows clear advantages. With initialization from our HyperNVD trained on 15 videos, the model converges faster to the same PSNR and ultimately achieves better performance.
  • Figure 2: The architecture of our HyperNVD. HyperNVD consists of i) the MAE encoder to generate video embedding, ii) the hypernet to generate model parameters, and iii) the target neural video decomposition (NVD) model. Given an input video, the MAE encoder encodes it to a compact embedding $e$. Then the hypernet $\mathcal{H}$ generates the parameters of the NVD model, including the MultiResolution Hash Encoding (MRHE) and the model weights. The NVD model includes two layer modules (foreground and background) for reconstructing different components in the video and an alpha module for predicting the soft mask to blend the layers. The reconstructed frame is generated by adding the foreground and background texture layers with an opacity map. Each layer module consists of a mapping module, a texture module, and a residual module.
  • Figure 3: Training of the autoencoder to get a compressed embedding from VideoMAE. First, we extract features $o$ from each 3D patch using the pre-trained VideoMAE. We then train an autoencoder to obtain a compressed embedding $e$ by minimizing the difference between the output $\hat{o}$ and the input $o$.
  • Figure 4: Comparison of the video "hike" reconstructed by models trained with different numbers of videos. The top row shows the full frames, while the bottom row zooms in on a specific patch. Models trained on multiple videos capture the primary structure of the frame but tend to lose some high-frequency details.
  • Figure 5: Comparison of the reconstruction quality for the "hike" video trained with different numbers of videos. Our HyperNVD model trained on multiple videos shows a 3dB drop in PSNR compared to single-video training. The PSNR remains stable across multiple video training, suggesting a performance plateau.
  • ...and 6 more figures