Table of Contents
Fetching ...

Video Quality Assessment for Online Processing: From Spatial to Temporal Sampling

Jiebin Yan, Lei Wu, Yuming Fang, Xuelin Liu, Xue Xia, Weide Liu

TL;DR

The paper tackles the problem of efficient video quality assessment by examining the effectiveness of joint spatio-temporal sampling to reduce input while maintaining accuracy. It formalizes a two-step pipeline where temporal sampling produces keyframes $F$ via $F = T(\\mathcal{V}; n, \\eta)$ and spatial sampling yields $\\hat{F}=S(F; g, s)$, followed by a lightweight spatio-temporal model $f_{st}=\\mathcal{N}(\\hat{F}; \\theta_n)$ and a regression head to predict $q$. The study conducts extensive experiments on six BVQA datasets, comparing various temporal (TSN/TSM/ECO) and spatial sampling schemes and introducing an online VQA variant MGQA that emphasizes efficiency with MobileNet and graph-based regression; results show substantial data reduction (e.g., MGQA processing only a fraction of the original data) with minimal performance loss in many settings. Overall, the work provides practical guidance for deploying online VQA systems and highlights how aggressive sampling can preserve perceptual quality prediction while enabling low-latency, scalable video QoE assessment.

Abstract

With the rapid development of multimedia processing and deep learning technologies, especially in the field of video understanding, video quality assessment (VQA) has achieved significant progress. Although researchers have moved from designing efficient video quality mapping models to various research directions, in-depth exploration of the effectiveness-efficiency trade-offs of spatio-temporal modeling in VQA models is still less sufficient. Considering the fact that videos have highly redundant information, this paper investigates this problem from the perspective of joint spatial and temporal sampling, aiming to seek the answer to how little information we should keep at least when feeding videos into the VQA models while with acceptable performance sacrifice. To this end, we drastically sample the video's information from both spatial and temporal dimensions, and the heavily squeezed video is then fed into a stable VQA model. Comprehensive experiments regarding joint spatial and temporal sampling are conducted on six public video quality databases, and the results demonstrate the acceptable performance of the VQA model when throwing away most of the video information. Furthermore, with the proposed joint spatial and temporal sampling strategy, we make an initial attempt to design an online VQA model, which is instantiated by as simple as possible a spatial feature extractor, a temporal feature fusion module, and a global quality regression module. Through quantitative and qualitative experiments, we verify the feasibility of online VQA model by simplifying itself and reducing input.

Video Quality Assessment for Online Processing: From Spatial to Temporal Sampling

TL;DR

The paper tackles the problem of efficient video quality assessment by examining the effectiveness of joint spatio-temporal sampling to reduce input while maintaining accuracy. It formalizes a two-step pipeline where temporal sampling produces keyframes via and spatial sampling yields , followed by a lightweight spatio-temporal model and a regression head to predict . The study conducts extensive experiments on six BVQA datasets, comparing various temporal (TSN/TSM/ECO) and spatial sampling schemes and introducing an online VQA variant MGQA that emphasizes efficiency with MobileNet and graph-based regression; results show substantial data reduction (e.g., MGQA processing only a fraction of the original data) with minimal performance loss in many settings. Overall, the work provides practical guidance for deploying online VQA systems and highlights how aggressive sampling can preserve perceptual quality prediction while enabling low-latency, scalable video QoE assessment.

Abstract

With the rapid development of multimedia processing and deep learning technologies, especially in the field of video understanding, video quality assessment (VQA) has achieved significant progress. Although researchers have moved from designing efficient video quality mapping models to various research directions, in-depth exploration of the effectiveness-efficiency trade-offs of spatio-temporal modeling in VQA models is still less sufficient. Considering the fact that videos have highly redundant information, this paper investigates this problem from the perspective of joint spatial and temporal sampling, aiming to seek the answer to how little information we should keep at least when feeding videos into the VQA models while with acceptable performance sacrifice. To this end, we drastically sample the video's information from both spatial and temporal dimensions, and the heavily squeezed video is then fed into a stable VQA model. Comprehensive experiments regarding joint spatial and temporal sampling are conducted on six public video quality databases, and the results demonstrate the acceptable performance of the VQA model when throwing away most of the video information. Furthermore, with the proposed joint spatial and temporal sampling strategy, we make an initial attempt to design an online VQA model, which is instantiated by as simple as possible a spatial feature extractor, a temporal feature fusion module, and a global quality regression module. Through quantitative and qualitative experiments, we verify the feasibility of online VQA model by simplifying itself and reducing input.
Paper Structure (19 sections, 13 equations, 5 figures, 10 tables)

This paper contains 19 sections, 13 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: An illustration of spatio-temporal sampling paradigm, which extracts representatives for quality prediction from stacked spatio-temporal blocks. Note that a video can be broken down into many spatio-temporal blocks by spatio-temporal grid division.
  • Figure 2: The whole framework of this study (a). Given a video sequence, we first squeeze the input video by joint spatial and temporal sampling (b and c), and can obtain stacked spatio-temporal blocks. The squeezed video is then fed into spatio-temporal modeling module, which is instantiated by a spatial feature extractor and a temporal feature fusion module under the philosophy of minimalism. Finally, the extracted features are then used to predict video quality.
  • Figure 3: The intuitive comparison of different temporal sampling methods. Both TSN and TSM divide a video into $M$ segments, and their difference is that TSM extracts four frames (default setting in this paper) at a fixed length for each segment while TSM only samples one frame from each segment. The difference between TSM and ECO is that TSM has a fixed number of segments that is determined by the number of video frames of each video, while ECO has a varying number of segments for different videos. ($M$represents the number of video blocks, $N$ denotes the total number of video frames, and $m$ indicates the number of frames contained in each block.)
  • Figure 4: Comparison of the proposed MGQA with different spatial feature extractors in terms of parameters and Flops.
  • Figure 5: The visual examples of spatio-temporal local quality maps, where blue areas refer to relatively low quality scores and red areas refer to high scores. (a)(b)(c) are from the KoNViD-1k dataset with a resolution of 540p; (d)(e)(f) are from the LIVE-VQC dataset with a resolution of 1080p.