Table of Contents
Fetching ...

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Renyu Yang, Jian Jin, Lili Meng, Meiqin Liu, Yilin Wang, Balu Adsumilli, Weisi Lin

TL;DR

This work designs a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments, and extends the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content.

Abstract

Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

TL;DR

This work designs a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments, and extends the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content.

Abstract

Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ
Paper Structure (21 sections, 12 figures, 3 tables)

This paper contains 21 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Distributions of MOS scores in our AVQA dataset. From left to right: overall audio-visual quality (AVQA), video-only quality (AV_VQA), and audio-only quality (AV_AQA). (Mean and standard deviation: AVQA: $\mu=3.47$, $\sigma=0.72$; AV_VQA: $\mu=3.49$, $\sigma=0.77$; AV_AQA: $\mu=3.44$, $\sigma=0.64$)
  • Figure 2: Distribution of average SROCC values across submissions in different crowdsourcing stages. Submission quality improves progressively from pretest to formal evaluation, validating the effectiveness of our multi-stage filtering pipeline.
  • Figure 3: Scatter plots of MOS comparisons: (a) AVQA vs AV_VQA, (b) AVQA vs AV_AQA, and (c) AV_VQA vs AV_AQA. All three pairs show high correlations, revealing strong monotonic relationships among the score types.
  • Figure 4: Screenshot of the Environment Preparation Confirmation Page
  • Figure 5: Screenshot of the A/V Sequence Rating Interface
  • ...and 7 more figures