Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, Haoqi Fan
TL;DR
This work tackles the challenge of evaluating and improving detailed image captions produced by modern vision-language models. It introduces DCScore, a granular metric that separates captions into primitive information units and assesses precision, recall, and overall fidelity via a three-step decomposition, matching, and verification process. Complementing this, DeCapBench provides a high-detail captioning benchmark, whose ground-truth granularity and human-aligned correlations surpass existing benchmarks. To optimize model behavior, the authors propose FeedQuill, a fine-grained feedback collection and PPO-based preference optimization framework that uses per-unit verification signals from multiple VLMs, achieving substantial reductions in hallucinations and superior detailed captioning performance, even surpassing GPT-4o on certain tasks. The combined DCScore, DeCapBench, and FeedQuill framework offer a scalable, human-aligned path to improve both the evaluation and generation of highly detailed captions in VLMs, with strong generalization across model families.
Abstract
Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.
