Table of Contents
Fetching ...

DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes

Hao Yan, Zhihui Ke, Xiaobo Zhou, Tie Qiu, Xidong Shi, Dadong Jiang

TL;DR

DS-NeRV is proposed, which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision and outperforms existing NeRV methods in many downstream tasks.

Abstract

Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV, which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision. By setting different sampling rates for two codes and applying weighted sum and interpolation sampling methods, DS-NeRV efficiently utilizes redundant static information while maintaining high-frequency details. Additionally, we design a cross-channel attention-based (CCA) fusion module to efficiently fuse these two codes for frame decoding. Our approach achieves a high quality reconstruction of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic codes representation and outperforms existing NeRV methods in many downstream tasks. Our project website is at https://haoyan14.github.io/DS-NeRV.

DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes

TL;DR

DS-NeRV is proposed, which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision and outperforms existing NeRV methods in many downstream tasks.

Abstract

Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV, which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision. By setting different sampling rates for two codes and applying weighted sum and interpolation sampling methods, DS-NeRV efficiently utilizes redundant static information while maintaining high-frequency details. Additionally, we design a cross-channel attention-based (CCA) fusion module to efficiently fuse these two codes for frame decoding. Our approach achieves a high quality reconstruction of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic codes representation and outperforms existing NeRV methods in many downstream tasks. Our project website is at https://haoyan14.github.io/DS-NeRV.
Paper Structure (30 sections, 8 equations, 12 figures, 11 tables)

This paper contains 30 sections, 8 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: (Left) Our proposed DS-NeRV decomposes the video into learnable static and dynamic codes, which represent static elements and dynamic elements in the video. (Right) Video reconstruction results for various implicit neural representations with 0.35M.
  • Figure 2: DS-NeRV framework overview. DS-NeRV decomposes the video into learnable static and dynamic codes. Static Codes. The two orange static codes shown above are the two nearest selected. After weighted sum, they are forwarded to the fusion decoder. Dynamic Codes. We interpolate the dynamic codes to match the length of the video. Then the dynamic code corresponding to $t$ is selected in blue.
  • Figure 3: (a) The pipeline of Fusion Decoder. Decoder takes the static code and dynamic code corresponding to index $t$ as input and fuses their information to output frame. (b) Architecture of CCA Fusion Module. The CCA module fuses static code $\widetilde{c}^s_t$ and dynamic code $\widetilde{c}^d_t$ by cross-channel attention.
  • Figure 4: Video reconstruction results on UVG and DAVIS. (Top) Jockey. (Bottom) Blackswan.
  • Figure 5: Video inpainting results on DAVIS. (Top) Car-Shadow with 5 masks of width 50.(Bottom) Boat with a central mask with width and height both 1/4 of the video.
  • ...and 7 more figures