HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks
Maria Pilligua, Danna Xue, Javier Vazquez-Corral
TL;DR
HyperNVD introduces a meta-learning framework that uses a hypernetwork to generate parameters for a compact INR-based neural video decomposition model, enabling fast adaptation to unseen videos. A frozen VideoMAE encoder provides video embeddings that condition the hypernetwork to output per-video weights and MRHE parameters, allowing multi-video training and rapid fine-tuning while preserving reconstruction quality. Empirical results on the DAVIS dataset show competitive quantitative performance with prior methods and substantial speedups when adapting to new scenes (e.g., about $0.8$ dB improvement on unseen videos and roughly $30$ minutes faster) along with robust editing capabilities. The approach reduces overfitting to single videos and improves editing efficiency, offering practical benefits for professional video editing workflows.
Abstract
Decomposing a video into a layer-based representation is crucial for easy video editing for the creative industries, as it enables independent editing of specific layers. Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our model is based on a hypernetwork architecture which, given a video-encoder embedding, generates the parameters for a compact INR-based neural video decomposition model. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos. Our code is available at: https://hypernvd.github.io/
