Table of Contents
Fetching ...

AIS 2024 Challenge on Video Quality Assessment of User-Generated Content: Methods and Results

Marcos V. Conde, Saman Zadtootaghaj, Nabajeet Barman, Radu Timofte, Chenlong He, Qi Zheng, Ruoxi Zhu, Zhengzhong Tu, Haiqiang Wang, Xiangguang Chen, Wenhui Meng, Xiang Pan, Huiying Shi, Han Zhu, Xiaozhong Xu, Lei Sun, Zhenzhong Chen, Shan Liu, Zicheng Zhang, Haoning Wu, Yingjie Zhou, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Wei Sun, Yuqin Cao, Yanwei Jiang, Jun Jia, Zhichao Zhang, Zijian Chen, Weixia Zhang, Xiongkuo Min, Steve Göring, Zihao Qi, Chen Feng

TL;DR

The AIS 2024 UGC Video Quality Assessment Challenge targets blind, no-reference VQA for user-generated content by requiring methods to predict perceptual quality of 30-frame clips within a 1-second budget on modern GPUs, using the YouTube-UGC dataset with MOS annotations. The paper surveys diverse deep learning approaches, including multi-branch models (COVER), hybrid LMM-enabled strategies (TVQE, Q-Align), and memory-efficient spatial-temporal architectures (SimpleVQA+, BVQA variants), along with ensemble and ranking-based training techniques. Key findings show top methods achieving near 0.9 correlation with MOS while maintaining real-time or near-real-time inference, and demonstrate the value of multi-faceted feature representations (semantic, aesthetic, technical) and cross-modal guidance for UGC quality prediction. The work highlights practical implications for streaming platforms seeking scalable, accurate, and explainable VQA metrics for diverse UGC content, and positions LMM-based VQA and multi-branch fusion as promising directions for future real-world deployments.

Abstract

This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed methods must process 30 FHD frames under 1 second. In the challenge, a total of 102 participants registered, and 15 submitted code and models. The performance of the top-5 submissions is reviewed and provided here as a survey of diverse deep models for efficient video quality assessment of user-generated content.

AIS 2024 Challenge on Video Quality Assessment of User-Generated Content: Methods and Results

TL;DR

The AIS 2024 UGC Video Quality Assessment Challenge targets blind, no-reference VQA for user-generated content by requiring methods to predict perceptual quality of 30-frame clips within a 1-second budget on modern GPUs, using the YouTube-UGC dataset with MOS annotations. The paper surveys diverse deep learning approaches, including multi-branch models (COVER), hybrid LMM-enabled strategies (TVQE, Q-Align), and memory-efficient spatial-temporal architectures (SimpleVQA+, BVQA variants), along with ensemble and ranking-based training techniques. Key findings show top methods achieving near 0.9 correlation with MOS while maintaining real-time or near-real-time inference, and demonstrate the value of multi-faceted feature representations (semantic, aesthetic, technical) and cross-modal guidance for UGC quality prediction. The work highlights practical implications for streaming platforms seeking scalable, accurate, and explainable VQA metrics for diverse UGC content, and positions LMM-based VQA and multi-branch fusion as promising directions for future real-world deployments.

Abstract

This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed methods must process 30 FHD frames under 1 second. In the challenge, a total of 102 participants registered, and 15 submitted code and models. The performance of the top-5 submissions is reviewed and provided here as a survey of diverse deep models for efficient video quality assessment of user-generated content.
Paper Structure (30 sections, 4 equations, 7 figures, 6 tables)

This paper contains 30 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Samples from the videos in the YT-UGC Dataset wang2019youtube.
  • Figure 2: The architecture of our proposed COmprehensive Video quality EvaluatoR (COVER). COVER processes a video clip in three parallel branches: 1) a semantic branch that extracts high-level object-semantics-related information using a pre-trained CLIP image Encoder; 2) an aesthetic branch that leverages a ConvNet run on subsampled image thumbnails to analyze their looking; 3) a technical branch utilizing Swin Transformer to execute on fragments. We also devise a simplified cross-gating block (SCGB) to fuse multi-branch features together, yielding the final quality score.
  • Figure 3: The architecture of the TVQE method.
  • Figure 4: The framework of Q-Align wu2023q, where we feed quality question-answer pairs to train LMMs and obtain the 5-level quality probabilities during the inference stage.
  • Figure 5: The framework of SimpleVQA+ sun2022deepsun2023analysis proposed by Team SJTU MMLab.
  • ...and 2 more figures