Table of Contents
Fetching ...

Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

Qizhi Xie, Kun Yuan, Yunpeng Qu, Ming Sun, Chao Zhou, Jihong Zhu

Abstract

Accurately estimating humans' subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.

Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

Abstract

Accurately estimating humans' subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.

Paper Structure

This paper contains 26 sections, 5 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation of video fluency assessment: The prevailing VQA paradigm is highly sensitive to spatial distortions, whereas VFA concentrates exclusively on underrepresented temporal factors (e.g., camera motion, playback stuttering).
  • Figure 1: Annotation GUI of human study.
  • Figure 2: Construction pipeline of FluVid dataset. First, FluVid selects raw videos based on two guiding principles (a, b) to prevent selection bias and mitigate long-tail distribution effects. Second, leveraging the raw videos, we incorporate semantic information to remove duplicates, segment the content to ensure balanced distribution, and finally filter out segments exhibiting severe spatial distortions (c). Subsequently, we engaged 20 visual assessment experts to evaluate video fluency in a controlled laboratory setting (d), resulting in a dataset balanced across both categories and fluency.
  • Figure 3: Statistical distribution of the FluVid dataset in terms of fluency, resolution, and duration.
  • Figure 4: Framework of our proposed FluNet. By incorporating the T-PSA module for dimensional compression, FluNet achieves enhanced temporal perception capabilities.
  • ...and 1 more figures