Table of Contents
Fetching ...

Video Quality Assessment with Texture Information Fusion for Streaming Applications

Vignesh V Menon, Prajit T Rajendran, Reza Farahani, Klaus Schoeffmann, Christian Timmerer

TL;DR

The paper addresses the need for fast, perceptually aligned video quality assessment in streaming by proposing VQ-TIF, a reduced-reference VQA that fuses DCT-energy-based texture features with SSIM through an LSTM to estimate VMAF. It uses $E_Y$, $h$, and $L_Y$ texture features extracted from luma and combines them with SSIM per frame to produce per-chunk estimates that are averaged for segment quality. On UHD content, VQ-TIF achieves $PCC=0.96$ and $MAE=2.71$ relative to ground-truth VMAF, while delivering a $9.14\times$ speed-up and a $89.44\%$ reduction in energy consumption. Trained and evaluated on the Inter4K UHD dataset with SDR content, the method demonstrates potential for real-time VQA in streaming and can be extended to HDR and higher resolutions in future work.

Abstract

The rise in video streaming applications has increased the demand for video quality assessment (VQA). In 2016, Netflix introduced Video Multi-Method Assessment Fusion (VMAF), a full reference VQA metric that strongly correlates with perceptual quality, but its computation is time-intensive. We propose a Discrete Cosine Transform (DCT)-energy-based VQA with texture information fusion (VQ-TIF) model for video streaming applications that determines the visual quality of the reconstructed video compared to the original video. VQ-TIF extracts Structural Similarity (SSIM) and spatiotemporal features of the frames from the original and reconstructed videos and fuses them using a long short-term memory (LSTM)-based model to estimate the visual quality. Experimental results show that VQ-TIF estimates the visual quality with a Pearson Correlation Coefficient (PCC) of 0.96 and a Mean Absolute Error (MAE) of 2.71, on average, compared to the ground truth VMAF scores. Additionally, VQ-TIF estimates the visual quality at a rate of 9.14 times faster than the state-of-the-art VMAF implementation, along with an 89.44 % reduction in energy consumption, assuming an Ultra HD (2160p) display resolution.

Video Quality Assessment with Texture Information Fusion for Streaming Applications

TL;DR

The paper addresses the need for fast, perceptually aligned video quality assessment in streaming by proposing VQ-TIF, a reduced-reference VQA that fuses DCT-energy-based texture features with SSIM through an LSTM to estimate VMAF. It uses , , and texture features extracted from luma and combines them with SSIM per frame to produce per-chunk estimates that are averaged for segment quality. On UHD content, VQ-TIF achieves and relative to ground-truth VMAF, while delivering a speed-up and a reduction in energy consumption. Trained and evaluated on the Inter4K UHD dataset with SDR content, the method demonstrates potential for real-time VQA in streaming and can be extended to HDR and higher resolutions in future work.

Abstract

The rise in video streaming applications has increased the demand for video quality assessment (VQA). In 2016, Netflix introduced Video Multi-Method Assessment Fusion (VMAF), a full reference VQA metric that strongly correlates with perceptual quality, but its computation is time-intensive. We propose a Discrete Cosine Transform (DCT)-energy-based VQA with texture information fusion (VQ-TIF) model for video streaming applications that determines the visual quality of the reconstructed video compared to the original video. VQ-TIF extracts Structural Similarity (SSIM) and spatiotemporal features of the frames from the original and reconstructed videos and fuses them using a long short-term memory (LSTM)-based model to estimate the visual quality. Experimental results show that VQ-TIF estimates the visual quality with a Pearson Correlation Coefficient (PCC) of 0.96 and a Mean Absolute Error (MAE) of 2.71, on average, compared to the ground truth VMAF scores. Additionally, VQ-TIF estimates the visual quality at a rate of 9.14 times faster than the state-of-the-art VMAF implementation, along with an 89.44 % reduction in energy consumption, assuming an Ultra HD (2160p) display resolution.
Paper Structure (23 sections, 3 equations, 4 figures, 1 table)

This paper contains 23 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The structure of state-of-the-art RR-VQA methods utilized, especially within streaming video coding systems.
  • Figure 2: Rate-distortion (RD) curves of selected segments of different spatiotemporal complexities -- Beauty ($E_{\text{Y}}$=59.90, $h$=17.49, $L_{\text{Y}}$=89.25), Basketball ($E_{\text{Y}}$=15.30, $h$=12.59, $L_{\text{Y}}$=108.18), Characters ($E_{\text{Y}}$=45.42, $h$=36.88, $L_{\text{Y}}$=134.56), and Runners ($E_{\text{Y}}$=105.85, $h$=22.48, $L_{\text{Y}}$=126.60). The segments are downsampled to 30 fps and encoded with the x264 AVC encoder using ultrafast preset and CRF rate control.
  • Figure 3: VQA for a video segment using VQ-TIF model envisioned in this paper.
  • Figure 4: Prediction results.