Table of Contents
Fetching ...

Highly Efficient No-reference 4K Video Quality Assessment with Full-Pixel Covering Sampling and Training Strategy

Xiaoheng Tan, Jiabin Zhang, Yuhui Quan, Jing Li, Yajing Wu, Zilin Bian

TL;DR

This work tackles no-reference 4K video quality assessment under practical compute constraints. It introduces Full-Pixel Covering (FuPiC) sampling to feed full 4K frame content into a Swin Transformer-based VQA model, with per-frame supervision computed as $O_f^k = \frac{1}{N}\sum_{i=1}^N o_i^k$, enabling efficient training and inference. A region-aware scoring scheme learns patch-level weights to reflect differential region importance, and a multi-frequency feature fusion based on the Haar Transform captures high-frequency details while reducing encoder input to one-quarter; the overall score $Q$ aggregates frame scores across time using $Q = \frac{\sum_{j=1}^{\lfloor T/t \rfloor} O_f^j}{\lfloor T/t \rfloor}$ with $t=10$. A new 4K NR VQA dataset is built, and extensive experiments show state-of-the-art performance on the 4K data and strong results on other NR datasets, with practical inference times (≈0.041 s per 4K frame on a V100). The approach demonstrates high practical impact for real-world 4K video QC, including streaming platforms and content restoration workflows.

Abstract

Deep Video Quality Assessment (VQA) methods have shown impressive high-performance capabilities. Notably, no-reference (NR) VQA methods play a vital role in situations where obtaining reference videos is restricted or not feasible. Nevertheless, as more streaming videos are being created in ultra-high definition (e.g., 4K) to enrich viewers' experiences, the current deep VQA methods face unacceptable computational costs. Furthermore, the resizing, cropping, and local sampling techniques employed in these methods can compromise the details and content of original 4K videos, thereby negatively impacting quality assessment. In this paper, we propose a highly efficient and novel NR 4K VQA technology. Specifically, first, a novel data sampling and training strategy is proposed to tackle the problem of excessive resolution. This strategy allows the VQA Swin Transformer-based model to effectively train and make inferences using the full data of 4K videos on standard consumer-grade GPUs without compromising content or details. Second, a weighting and scoring scheme is developed to mimic the human subjective perception mode, which is achieved by considering the distinct impact of each sub-region within a 4K frame on the overall perception. Third, we incorporate the frequency domain information of video frames to better capture the details that affect video quality, consequently further improving the model's generalizability. To our knowledge, this is the first technology for the NR 4K VQA task. Thorough empirical studies demonstrate it not only significantly outperforms existing methods on a specialized 4K VQA dataset but also achieves state-of-the-art performance across multiple open-source NR video quality datasets.

Highly Efficient No-reference 4K Video Quality Assessment with Full-Pixel Covering Sampling and Training Strategy

TL;DR

This work tackles no-reference 4K video quality assessment under practical compute constraints. It introduces Full-Pixel Covering (FuPiC) sampling to feed full 4K frame content into a Swin Transformer-based VQA model, with per-frame supervision computed as , enabling efficient training and inference. A region-aware scoring scheme learns patch-level weights to reflect differential region importance, and a multi-frequency feature fusion based on the Haar Transform captures high-frequency details while reducing encoder input to one-quarter; the overall score aggregates frame scores across time using with . A new 4K NR VQA dataset is built, and extensive experiments show state-of-the-art performance on the 4K data and strong results on other NR datasets, with practical inference times (≈0.041 s per 4K frame on a V100). The approach demonstrates high practical impact for real-world 4K video QC, including streaming platforms and content restoration workflows.

Abstract

Deep Video Quality Assessment (VQA) methods have shown impressive high-performance capabilities. Notably, no-reference (NR) VQA methods play a vital role in situations where obtaining reference videos is restricted or not feasible. Nevertheless, as more streaming videos are being created in ultra-high definition (e.g., 4K) to enrich viewers' experiences, the current deep VQA methods face unacceptable computational costs. Furthermore, the resizing, cropping, and local sampling techniques employed in these methods can compromise the details and content of original 4K videos, thereby negatively impacting quality assessment. In this paper, we propose a highly efficient and novel NR 4K VQA technology. Specifically, first, a novel data sampling and training strategy is proposed to tackle the problem of excessive resolution. This strategy allows the VQA Swin Transformer-based model to effectively train and make inferences using the full data of 4K videos on standard consumer-grade GPUs without compromising content or details. Second, a weighting and scoring scheme is developed to mimic the human subjective perception mode, which is achieved by considering the distinct impact of each sub-region within a 4K frame on the overall perception. Third, we incorporate the frequency domain information of video frames to better capture the details that affect video quality, consequently further improving the model's generalizability. To our knowledge, this is the first technology for the NR 4K VQA task. Thorough empirical studies demonstrate it not only significantly outperforms existing methods on a specialized 4K VQA dataset but also achieves state-of-the-art performance across multiple open-source NR video quality datasets.
Paper Structure (23 sections, 5 equations, 7 figures, 3 tables)

This paper contains 23 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of data sampling strategies in VQA methods on 4K videos. (a) The commonly used sampling strategies and our proposed strategy. (b) The content-covering percentage of sampling strategies on 4K videos. Both traditional and grid-based strategies cover only a minimal amount of content, where the grid-based method also accesses some global information. In contrast, our strategy is capable of covering the entire content.
  • Figure 2: Overall of our proposed method. We utilize a Swin Transformer as the Encoder to extract features.
  • Figure 3: Comparing on multi-frequency of two similar frames from different videos.
  • Figure 4: Distribution percentage of video indicators. The normalized feature space of each indicator is uniformly divided into 5 Bins.
  • Figure 5: Distribution percentage of MOS in our dataset.
  • ...and 2 more figures