Table of Contents
Fetching ...

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, Hongtao Xie

TL;DR

CAPability tackles the mismatch between modern multimodal captioning capabilities and outdated benchmarks by introducing a holistic, multi-view visual-caption benchmark for images and videos. It combines 12 dimensions across 6 views, a dual evaluation of correctness and thoroughness via precision and hit, and a QA-based $KbarT$ metric to quantify gaps between QA understanding and captioning. The authors collect ~11K annotated samples, provide CAPability-QA, and show distinct model strengths and weaknesses across dimensions, such as counting, camera-angle, and action understanding, highlighting concrete avenues for improvement. By open-sourcing the data and evaluation protocol, CAPability aims to drive the development of more accurate and comprehensive captioning systems for real-world multimodal understanding.

Abstract

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs' captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

TL;DR

CAPability tackles the mismatch between modern multimodal captioning capabilities and outdated benchmarks by introducing a holistic, multi-view visual-caption benchmark for images and videos. It combines 12 dimensions across 6 views, a dual evaluation of correctness and thoroughness via precision and hit, and a QA-based metric to quantify gaps between QA understanding and captioning. The authors collect ~11K annotated samples, provide CAPability-QA, and show distinct model strengths and weaknesses across dimensions, such as counting, camera-angle, and action understanding, highlighting concrete avenues for improvement. By open-sourcing the data and evaluation protocol, CAPability aims to drive the development of more accurate and comprehensive captioning systems for real-world multimodal understanding.

Abstract

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} (), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs' captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.

Paper Structure

This paper contains 27 sections, 5 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: An example of image caption (left) and video caption (right) task. By analyzing the components of captions, we conclude 12 dimensions (9 static dimensions and 4 dynamic dimensions with object number shares on both static and dynamic), which all contribute to a detailed and comprehensive caption. The static dimensions are shared in both images and videos. Video data has additional dynamic dimensions that need to be judged with temporal relations.
  • Figure 2: The development of visual caption benchmarks. Many works compare the ground-truth with generated sentences, which is vague. CompreCap comprecap uses a scene graph to evaluate only object-related information. Our CAPability considers multiple views with a comprehensive evaluation.
  • Figure 3: Precision and hit comparison of SOTA MLLMs on our CAPability. Models perform more variably on hit metric, which evaluates the thoroughness. GPT-4o performs the best on precision, and Gemini-1.5-pro gemini1.5 performs the best on hit.
  • Figure 4: The pipeline of our data annotation for each dimension.
  • Figure 5: The annotation distribution of each dimension. We statistic different dimensions with different types. We count the frequency in object categories, character identification, and action as most of the descriptions only appear one time. For spatial relation, we summarize 4 categories and count their numbers. For style, camera angle, and camera movement, we count the samples of each category. For others, we plot bar charts to count and show the most frequent samples.
  • ...and 13 more figures