Table of Contents
Fetching ...

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, Haoqi Fan

TL;DR

This work tackles the challenge of evaluating and improving detailed image captions produced by modern vision-language models. It introduces DCScore, a granular metric that separates captions into primitive information units and assesses precision, recall, and overall fidelity via a three-step decomposition, matching, and verification process. Complementing this, DeCapBench provides a high-detail captioning benchmark, whose ground-truth granularity and human-aligned correlations surpass existing benchmarks. To optimize model behavior, the authors propose FeedQuill, a fine-grained feedback collection and PPO-based preference optimization framework that uses per-unit verification signals from multiple VLMs, achieving substantial reductions in hallucinations and superior detailed captioning performance, even surpassing GPT-4o on certain tasks. The combined DCScore, DeCapBench, and FeedQuill framework offer a scalable, human-aligned path to improve both the evaluation and generation of highly detailed captions in VLMs, with strong generalization across model families.

Abstract

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

TL;DR

This work tackles the challenge of evaluating and improving detailed image captions produced by modern vision-language models. It introduces DCScore, a granular metric that separates captions into primitive information units and assesses precision, recall, and overall fidelity via a three-step decomposition, matching, and verification process. Complementing this, DeCapBench provides a high-detail captioning benchmark, whose ground-truth granularity and human-aligned correlations surpass existing benchmarks. To optimize model behavior, the authors propose FeedQuill, a fine-grained feedback collection and PPO-based preference optimization framework that uses per-unit verification signals from multiple VLMs, achieving substantial reductions in hallucinations and superior detailed captioning performance, even surpassing GPT-4o on certain tasks. The combined DCScore, DeCapBench, and FeedQuill framework offer a scalable, human-aligned path to improve both the evaluation and generation of highly detailed captions in VLMs, with strong generalization across model families.

Abstract

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Paper Structure

This paper contains 44 sections, 2 equations, 7 figures, 16 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the proposed DCScore for evaluating detailed image captioning. (1) Given the image and prompt, model generated responses and human written responses are decomposed into sets of primitive information units. (2) We match the primitive information units of generated response $\mathcal{P}$ and those of human written response $\mathcal{O}$. (3) Each primitive information unit in $\mathcal{P}$ is verified individually by VLM given the content of images.
  • Figure 2: (Left) Comparison of four sources for ground-truth captions in terms of correlation between DCScore and human judgments. All p-values are less than $0.001$. (Right) DeCapBench achieves the highest correlation with Arena Elo, with a Spearman's correlation of 0.90 among different VLM benchmarks.
  • Figure 3: Impact of the preference dataset size in terms of downstream performance.
  • Figure 4: Qualitative results of FeedQuill-7B compared with LLaVA-Onevision-7B li2024llavaonevision in terms of image captioning.
  • Figure 5: Qualitative results of FeedQuill-7B compared with LLaVA-Onevision-7B li2024llavaonevision in terms of image captioning.(1)
  • ...and 2 more figures