Table of Contents
Fetching ...

CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu

TL;DR

CaptionQA introduces a utility-driven benchmark to evaluate how well image captions preserve downstream utility across four real-world domains. By pairing domain-specific taxonomies with a deterministic QA-on-caption protocol and dense MC questions, it directly measures the information a caption preserves for downstream LLM reasoning. The study reveals substantial gaps between image utility and caption utility, with gaps varying by domain and model type, and shows that longer or more complex prompts often do not improve—and can even hurt—caption usefulness. The authors provide an open-source, extensible pipeline for domain expansion and offer practical guidance on caption design and evaluation for industry and research applications. Overall, CaptionQA offers a principled, scalable framework to align captioning systems with real-world downstream tasks and decision-making needs.

Abstract

Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

CaptionQA: Is Your Caption as Useful as the Image Itself?

TL;DR

CaptionQA introduces a utility-driven benchmark to evaluate how well image captions preserve downstream utility across four real-world domains. By pairing domain-specific taxonomies with a deterministic QA-on-caption protocol and dense MC questions, it directly measures the information a caption preserves for downstream LLM reasoning. The study reveals substantial gaps between image utility and caption utility, with gaps varying by domain and model type, and shows that longer or more complex prompts often do not improve—and can even hurt—caption usefulness. The authors provide an open-source, extensible pipeline for domain expansion and offer practical guidance on caption design and evaluation for industry and research applications. Overall, CaptionQA offers a principled, scalable framework to align captioning systems with real-world downstream tasks and decision-making needs.

Abstract

Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

Paper Structure

This paper contains 72 sections, 2 equations, 30 figures, 26 tables.

Figures (30)

  • Figure 1: CaptionQA taxonomies across four domains, the visual information that captions must carry to be useful for downstream tasks. The Natural domain (6 top-level, 22 subcategories) emphasizes object properties, spatial relations, and hallucination; the Document domain (6, 15) targets layout, content, and document-specific structure; the E-commerce domain (7, 16) focuses on product attributes and presentation; and the Embodied AI domain (6, 16) captures perception, spatial understanding, and task-relevant cues for robotics.
  • Figure 2: Comparison of text-only QA LLMs (GPT-5, Gemini 2.5 Pro, DeepSeek-R1 Llama 70B, Qwen2.5 72B) along four axes: faithfulness, efficiency (QPS), stability, and performance.
  • Figure 3: Benchmark construction pipeline. Starting from a human-designed taxonomy and curated images for each domain, we use multiple generators to produce a large pool of taxonomy-guided questions. This pool is then refined by (1) embedding-based deduplication, (2) a text-only blind test to remove questions answerable from priors, (3) dual-VLM quality control to flag ungrounded or reasoning-heavy items, and (4) final human refinement, yielding high-quality, utility-focused QA pairs.
  • Figure 4: Overall gap between QA-on-image and QA-on-caption for GPT-5, Gemini-2.5-Pro, Qwen3-VL-30B-A3B, GLM-4.1V-9B, InternVL3.5-38B, Claude-Sonnet-4.5, and LLaVA-OV-7B. Each bar shows the difference in CaptionQA Acc., averaged over the four domains.
  • Figure 5: Qualitative example of caption under a complex prompt, Taxonomy-Hinted. Although GPT-5 is instructed to describe the image and focus on provided aspects, it outputs in a fill-in-the-blank style and provide much less information than Long prompt.
  • ...and 25 more figures