Table of Contents
Fetching ...

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

TL;DR

Q-Bench-Video tackles the neglected problem of evaluating video quality understanding in Large Multi-modal Models. The authors design a diverse benchmark with videos from natural scenes, AIGC, and CG, multiple question types, and a video pair task, along with an expanded distortion taxonomy including AIGC distortions. They annotate 2,378 QA over 1,800 videos and evaluate 17 LMMs (open-source and proprietary), finding that while models show basic video quality perception, they lag behind human performance, especially on open-ended and AIGC distortion questions. The benchmark provides a framework to drive progress in video quality understanding for LMMs and has implications for video compression, generation, and perception tasks.

Abstract

With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding.

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

TL;DR

Q-Bench-Video tackles the neglected problem of evaluating video quality understanding in Large Multi-modal Models. The authors design a diverse benchmark with videos from natural scenes, AIGC, and CG, multiple question types, and a video pair task, along with an expanded distortion taxonomy including AIGC distortions. They annotate 2,378 QA over 1,800 videos and evaluate 17 LMMs (open-source and proprietary), finding that while models show basic video quality perception, they lag behind human performance, especially on open-ended and AIGC distortion questions. The benchmark provides a framework to drive progress in video quality understanding for LMMs and has implications for video compression, generation, and perception tasks.

Abstract

With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding.
Paper Structure (40 sections, 8 equations, 8 figures, 7 tables)

This paper contains 40 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The construction overview of the proposed Q-Bench-Video. To ensure diversity in video content, we collect natural scenes, AIGC, and CG videos from video quality assessment datasets as depicted in (a). To achieve a balanced quality distribution among the sampled videos, we employ uniform sampling for quality control. As indicated in (c-1) and (c-2), we utilize three types of questions (Yes-or-No, What-How, Open-ended) and address a comprehensive range of quality concerns including Technical, Aesthetic, Temporal, and AIGC distortions. Additionally, we incorporate the video pairs comparison task to enhance the comprehensiveness of the benchmark.
  • Figure 2: The visualization samples from Q-Bench-Video, with the question-answer content most representative of each subcategory being underlined. It is important to note that, regarding quality concerns, a single question-answer annotation may not only focus on one distortion dimension. Therefore, the distortion visualization examples shown in (b) primarily highlight instances that are most closely aligned with the mentioned distortion types.
  • Figure 3: Illustration of the annotation GUIs for Q-Bench-Video. (a) shows the interface for annotating single videos, where the annotator can select the question type and play the videos using the Video Play button. The annotator can also switch to the next and previous annotation with the Next and Previous buttons. (b) presents the interface for annotating video pairs. When the annotator presses the Video Play button, the video pairs are played sequentially, with a five-second gray screen serving as an interval between the two videos.
  • Figure 4: A concise summary of the LMMs' performance on Q-Bench-Video. (a) provides a comparison detailing the overall performance of humans and 17 selected LMMs, including both proprietary and open-source models. (b) illustrates a radar chart that outlines the performance of the top-2proprietary LMMs (GPT-4o & Gemini 1.5 Pro) and open-source LMMs (mPLUG-Owl3 & LLaVA-OneVision) across various subcategories within Q-Bench-Video.
  • Figure 5: Illustration of the annotation GUIs for Q-Bench-Video. (a) shows the interface for annotating single videos, where the annotator can select the question type and play the videos using the Video Play button. The annotator can also switch to the next and previous annotation with the Next and Previous buttons. (b) presents the interface for annotating video pairs. When the annotator presses the Video Play button, the video pairs are played sequentially, with a five-second gray screen serving as an interval between the two videos.
  • ...and 3 more figures