Table of Contents
Fetching ...

FineVQ: Fine-Grained User Generated Content Video Quality Assessment

Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, Guangtao Zhai

TL;DR

This work addresses the need for fine-grained assessment of user-generated content videos by introducing FineVD, a large-scale dataset with multi-dimensional quality labels, and FineVQ, a one-for-all VQA framework built on large multimodal models. FineVQ leverages an image encoder, a motion encoder, and a large language model, augmented with instruction tuning and LoRA adaptation to produce quality rating, scoring, and attribution across multiple dimensions. Extensive experiments show state-of-the-art performance on FineVD and several UGC-VQA benchmarks, along with strong cross-dataset generalization and meaningful ablation insights into motion features and parameter-efficient fine-tuning. The work offers a practical platform for improved video processing and recommendation in UGC ecosystems by providing rich, actionable quality annotations and a capable, adaptable VQA system.

Abstract

The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets.

FineVQ: Fine-Grained User Generated Content Video Quality Assessment

TL;DR

This work addresses the need for fine-grained assessment of user-generated content videos by introducing FineVD, a large-scale dataset with multi-dimensional quality labels, and FineVQ, a one-for-all VQA framework built on large multimodal models. FineVQ leverages an image encoder, a motion encoder, and a large language model, augmented with instruction tuning and LoRA adaptation to produce quality rating, scoring, and attribution across multiple dimensions. Extensive experiments show state-of-the-art performance on FineVD and several UGC-VQA benchmarks, along with strong cross-dataset generalization and meaningful ablation insights into motion features and parameter-efficient fine-tuning. The work offers a practical platform for improved video processing and recommendation in UGC ecosystems by providing rich, actionable quality annotations and a capable, adaptable VQA system.

Abstract

The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets.
Paper Structure (51 sections, 8 equations, 10 figures, 10 tables)

This paper contains 51 sections, 8 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: We present the fine-grained video quality assessment database and model, termed FineVD and FineVQ, respectively. UGC videos have diverse video content but suffer from various degradation issues as shown in (a) and (b), thus it is important to provide fine-grained quality labels for subsequent video processing and recommendation tasks in addition to only providing an overall quality score. To tackle the challenges, we construct FineVD, which includes fine-grained quality annotations for the UGC videos as shown in (c), and propose FineVQ, which has capabilities of quality rating, quality scoring, and quality attribution, as demonstrated in (d).
  • Figure 2: An overview of the content and construction process of FineVD. (a) Example videos from our database, which contains both common UGC videos and short-form UGC videos. (b) The illustration of subjective data annotation methods, including both quality scoring and quality attribute labeling processes. (c) The quality-related question-answering pairs generated by GPT-4 and revised by human annotators.
  • Figure 3: The MOS distribution of FineVD in terms of different perspectives, i.e., color, noise, artifact, blur, temporal, and overall.
  • Figure 4: The MOS distribution of "overall score" in terms of different video contents. (a) MOS distribution for on-demand videos. (b) MOS distribution for live-streaming videos.
  • Figure 5: An overview of our proposed FineVQ model. Our model consists of three feature encoders, including an image feature extractor for extracting spatial features from sparse video frames, a motion feature extractor for extracting motion features from the entire video, and a text encoder for extracting aligned text features from prompts. The extracted features are then aligned through projectors and fed into a pre-trained LLM to generate the output results. LoRA weights are introduced to the pre-trained image encoder and the large language model to adapt the models to the quality assessment task.
  • ...and 5 more figures