Table of Contents
Fetching ...

Neural Video Representation for Redundancy Reduction and Consistency Preservation

Taiga Hayami, Takahiro Shindo, Shunsuke Akamatsu, Hiroshi Watanabe

TL;DR

A video representation method that generates both the high-frequency and low-frequency components of the frame, using features extracted from the high-frequency components and temporal information, respectively is proposed, which improves the reconstruction quality of the high-frequency components and enhances the temporal consistency of the frames.

Abstract

Implicit neural representation (INR) embed various signals into neural networks. They have gained attention in recent years because of their versatility in handling diverse signal types. In the context of video, INR achieves video compression by embedding video signals directly into networks and compressing them. Conventional methods either use an index that expresses the time of the frame or features extracted from individual frames as network inputs. The latter method provides greater expressive capability as the input is specific to each video. However, the features extracted from frames often contain redundancy, which contradicts the purpose of video compression. Additionally, such redundancies make it challenging to accurately reconstruct high-frequency components in the frames. To address these problems, we focus on separating the high-frequency and low-frequency components of the reconstructed frame. We propose a video representation method that generates both the high-frequency and low-frequency components of the frame, using features extracted from the high-frequency components and temporal information, respectively. Experimental results demonstrate that our method outperforms the existing HNeRV method, achieving superior results in 96 percent of the videos.

Neural Video Representation for Redundancy Reduction and Consistency Preservation

TL;DR

A video representation method that generates both the high-frequency and low-frequency components of the frame, using features extracted from the high-frequency components and temporal information, respectively is proposed, which improves the reconstruction quality of the high-frequency components and enhances the temporal consistency of the frames.

Abstract

Implicit neural representation (INR) embed various signals into neural networks. They have gained attention in recent years because of their versatility in handling diverse signal types. In the context of video, INR achieves video compression by embedding video signals directly into networks and compressing them. Conventional methods either use an index that expresses the time of the frame or features extracted from individual frames as network inputs. The latter method provides greater expressive capability as the input is specific to each video. However, the features extracted from frames often contain redundancy, which contradicts the purpose of video compression. Additionally, such redundancies make it challenging to accurately reconstruct high-frequency components in the frames. To address these problems, we focus on separating the high-frequency and low-frequency components of the reconstructed frame. We propose a video representation method that generates both the high-frequency and low-frequency components of the frame, using features extracted from the high-frequency components and temporal information, respectively. Experimental results demonstrate that our method outperforms the existing HNeRV method, achieving superior results in 96 percent of the videos.
Paper Structure (10 sections, 1 equation, 6 figures, 1 table)

This paper contains 10 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Visualization of reconstruction results in "soapbpx" and "stroller" videos. The red number is the PNSR for the frame.
  • Figure 2: The architecture of proposed method. It consists of the HF-stream (blue arrow) and the LF-stream (green line). The part circled in purple is treated as video.
  • Figure 3: Visualization results of features extracted by the encoder.
  • Figure 4: Comparison of PSNR between the proposed method and HNeRV for reconstructed videos. The horizontal axis represents the video sequences from the DAVIS dataset, and the vertical axis represents the PSNR difference between the proposed method and HNeRV. Positive values on the vertical axis indicate that the proposed method outperforms HNeRV. Video sequences where HNeRV did not perform well are highlighted in red, while those using the proposed method are highlighted in orange. The proposed method ensures a minimum level of quality even when training does not fully converge.
  • Figure 5: Visualization of consecutive frames from "hockey" video. The red numbers indicate the PSNR for the entire frame.
  • ...and 1 more figures