Table of Contents
Fetching ...

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, Limin Wang

TL;DR

CG-Bench introduces a large-scale, clue-grounded benchmark for long-video question answering to address the credibility gaps in MCQ-based evaluations. It combines a richly annotated long-video dataset (1,219 videos, 12,129 QAC triplets) with clue-grounding (white-box and black-box) and clue-aided open-ended QA evaluation, enabling robust assessment of whether models ground answers in relevant video clues. Experimental results show current MLLMs struggle with long-video understanding and grounding, with significant drops when credibility constraints are applied, highlighting substantial room for improvement. The benchmark and evaluation toolkit aim to drive the development of more trustworthy, capable multimodal LLMs for long-context video tasks.

Abstract

Most existing video understanding benchmarks for multimodal large language models (MLLMs) focus only on short videos. The limited number of benchmarks for long video understanding often rely solely on multiple-choice questions (MCQs). However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content. To address this gap, we introduce CG-Bench, a novel benchmark designed for clue-grounded question answering in long videos. CG-Bench emphasizes the model's ability to retrieve relevant clues for questions, enhancing evaluation credibility. It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories, making it the largest benchmark for long video analysis. The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination. Compensating the drawbacks of pure MCQ-based evaluation, we design two novel clue-based evaluation methods: clue-grounded white box and black box evaluations, to assess whether the model generates answers based on the correct understanding of the video. We evaluate multiple closed-source and open-source MLLMs on CG-Bench. Results indicate that current models significantly underperform in understanding long videos compared to short ones, and a significant gap exists between open-source and commercial models. We hope CG-Bench can advance the development of more trustworthy and capable MLLMs for long video understanding. All annotations and video data are released at https://cg-bench.github.io/leaderboard/.

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

TL;DR

CG-Bench introduces a large-scale, clue-grounded benchmark for long-video question answering to address the credibility gaps in MCQ-based evaluations. It combines a richly annotated long-video dataset (1,219 videos, 12,129 QAC triplets) with clue-grounding (white-box and black-box) and clue-aided open-ended QA evaluation, enabling robust assessment of whether models ground answers in relevant video clues. Experimental results show current MLLMs struggle with long-video understanding and grounding, with significant drops when credibility constraints are applied, highlighting substantial room for improvement. The benchmark and evaluation toolkit aim to drive the development of more trustworthy, capable multimodal LLMs for long-context video tasks.

Abstract

Most existing video understanding benchmarks for multimodal large language models (MLLMs) focus only on short videos. The limited number of benchmarks for long video understanding often rely solely on multiple-choice questions (MCQs). However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content. To address this gap, we introduce CG-Bench, a novel benchmark designed for clue-grounded question answering in long videos. CG-Bench emphasizes the model's ability to retrieve relevant clues for questions, enhancing evaluation credibility. It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories, making it the largest benchmark for long video analysis. The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination. Compensating the drawbacks of pure MCQ-based evaluation, we design two novel clue-based evaluation methods: clue-grounded white box and black box evaluations, to assess whether the model generates answers based on the correct understanding of the video. We evaluate multiple closed-source and open-source MLLMs on CG-Bench. Results indicate that current models significantly underperform in understanding long videos compared to short ones, and a significant gap exists between open-source and commercial models. We hope CG-Bench can advance the development of more trustworthy and capable MLLMs for long video understanding. All annotations and video data are released at https://cg-bench.github.io/leaderboard/.

Paper Structure

This paper contains 16 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Left: examples of CG-Bench's clue-grounded annotation. To correctly answer the questions, models need to ground their reasoning into the correct clue. Right: CG-Bench provides an evaluation suite with two novel credibility evaluation criteria while supporting both MCQ and open-ended evaluations.
  • Figure 2: Distribution of video root categories, displaying the number of videos within each category.
  • Figure 3: Distribution of question root types, illustrating the frequency of different question types.
  • Figure 4: Video duration distribution, showing the number of videos for different duration intervals.
  • Figure 5: Clue time coverage, illustrating the frequency of clues across different time bins.
  • ...and 4 more figures