Table of Contents
Fetching ...

FVQ: A Large-Scale Dataset and an LMM-based Method for Face Video Quality Assessment

Sijing Wu, Yunhao Li, Ziwen Xu, Yixuan Gao, Huiyu Duan, Wei Sun, Guangtao Zhai

TL;DR

This work tackles face video quality assessment (FVQA) by introducing FVQ-20K, the first large-scale in-the-wild FVQA dataset with 20,000 face videos and MOS annotations, collected from TikTok and YouTube to capture diverse distortions and attributes. It also presents FVQ-Rater, a novel large multimodal model-based FVQA method that fuses spatial, temporal, portrait, and face-embedding cues within an LLM and uses LoRA-based instruction tuning to predict MOS and quality levels without reference data. The paper demonstrates that FVQ-Rater surpasses state-of-the-art VQA, FIQA, and LMM baselines on FVQ-20K and CFVQA, and exhibits strong cross-dataset generalization, underscoring the potential of LMMs for FVQA. These contributions provide a valuable resource and a scalable methodology for objective FVQA aligned with human perceptual quality, with practical impact for social platforms and face-centric video restoration and analysis.

Abstract

Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.

FVQ: A Large-Scale Dataset and an LMM-based Method for Face Video Quality Assessment

TL;DR

This work tackles face video quality assessment (FVQA) by introducing FVQ-20K, the first large-scale in-the-wild FVQA dataset with 20,000 face videos and MOS annotations, collected from TikTok and YouTube to capture diverse distortions and attributes. It also presents FVQ-Rater, a novel large multimodal model-based FVQA method that fuses spatial, temporal, portrait, and face-embedding cues within an LLM and uses LoRA-based instruction tuning to predict MOS and quality levels without reference data. The paper demonstrates that FVQ-Rater surpasses state-of-the-art VQA, FIQA, and LMM baselines on FVQ-20K and CFVQA, and exhibits strong cross-dataset generalization, underscoring the potential of LMMs for FVQA. These contributions provide a valuable resource and a scalable methodology for objective FVQA aligned with human perceptual quality, with practical impact for social platforms and face-centric video restoration and analysis.

Abstract

Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.

Paper Structure

This paper contains 24 sections, 13 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Sample video frames and corresponding MOSs from five quality levels of the proposed FVQ-20K dataset. We encourage readers to zoom-in for details.
  • Figure 2: The bar charts of the resolution distribution (Left) and FPS distribution (Right) of the face videos in FVQ-20K.
  • Figure 3: The MOS distributions in terms of different video source categories: (a) two video platforms, (b) eleven categories of TikTok videos, and (c) nine categories of YouTube videos.
  • Figure 4: The MOS distributions in terms of different face attributes: (a) gender, (b) race, (c) age, and (d) emotion.
  • Figure 5: Overview of our FVQ-Rater method. Given face video and text prompt, FVQ-Rater first extracts multi-dimensional features through the vision encoder, face encoder, and temporal encoder, and then projects the features into the LLM input space through four different projectors. The feature embeddings are then combined with text embeddings and fed into the pre-trained LLM for further processing. The output features of the LLM are decoded to text output or partially fed into the quality regression module to predict the quality score. FVQ-Rater is trained in two stages: quality-aware pre-training via quality levels described in textual form and MOS-oriented LoRA fine-tuning using quality scores.
  • ...and 7 more figures