Table of Contents
Fetching ...

EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training

Yiying Wei, Hadi Amirpour, Jong Hwan Ko, Christian Timmerer

TL;DR

This work proposes an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames, and achieves an 83% decrease in overall run time.

Abstract

Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.

EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training

TL;DR

This work proposes an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames, and achieves an 83% decrease in overall run time.

Abstract

Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.

Paper Structure

This paper contains 10 sections, 2 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overview of EPS method. Each video frame is sliced into patches. The informative complexity of each patch is determined by spatial features (SF) and temporal features (TF). For each frame, we group all patches into $N$ clusters based on the histogram distribution of feature scores and select the cluster of highest spatial-temporal information for training a content-aware SR model (using a pre-trained model as a basis). In the figure, we set the number of clusters to two for better readability. The blue and orange patches represent clusters with high SF and TF scores, respectively.
  • Figure 2: Example heatmaps of SF and TF scores of the video frames from the Inter4K dataset Inter4K_dataset.
  • Figure 3: Example of two different patch sampling algorithms. Selecting the top $r\%$ of patches may either miss important information or lead to redundant training on similar patches, while our sampling algorithm mitigates these issues.
  • Figure 4: Example of the proposed patch sampling algorithm for $N=\{2,3\}$.
  • Figure 5: Super-resolution quality comparison using our method (seventh column) with baseline methods.
  • ...and 1 more figures