Table of Contents
Fetching ...

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin

TL;DR

ScaleCap tackles biases in open LVLM image captioning by combining a heuristic question answering loop with offline contrastive sentence rating to progressively enrich and calibrate captions within a controllable inference budget. It introduces ScaleCap-450k, a large, high-quality caption dataset, and demonstrates consistent gains across 11 benchmarks and downstream tasks, including VQA and image reconstruction, when used for LVLM pretraining. Within the Prism framework, ScaleCap-based captions reach higher informativeness than larger LVLMs, evidencing stronger semantic coverage and more faithful visual grounding. The approach offers a practical, scalable path to high-quality captioning by guiding existing models with structured prompts and offline evaluation, enhancing accessibility, retrieval, and multimodal learning.

Abstract

This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

TL;DR

ScaleCap tackles biases in open LVLM image captioning by combining a heuristic question answering loop with offline contrastive sentence rating to progressively enrich and calibrate captions within a controllable inference budget. It introduces ScaleCap-450k, a large, high-quality caption dataset, and demonstrates consistent gains across 11 benchmarks and downstream tasks, including VQA and image reconstruction, when used for LVLM pretraining. Within the Prism framework, ScaleCap-based captions reach higher informativeness than larger LVLMs, evidencing stronger semantic coverage and more faithful visual grounding. The approach offers a practical, scalable path to high-quality captioning by guiding existing models with structured prompts and offline evaluation, enhancing accessibility, retrieval, and multimodal learning.

Abstract

This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.

Paper Structure

This paper contains 21 sections, 7 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison between the captions generated by our ScaleCap and those produced by other advanced VLMs. The parts of the caption that are bolded refer to the detailed descriptions of the object, while the parts that do not mention the target object are included in the ellipsis.
  • Figure 2: The reason for certain object detail omissions in LVLM captions is mainly due to the absence of guiding heuristic questions rather than insufficient perceptual capability. We also observe that 7B and 72B LVLMs exhibit similar perceptual capabilities.
  • Figure 3: Overview of ScaleCap. ScaleCap is composed of two synergistic parts: heuristic question answering and contrastive sentence rating. The first module utilizes a general-purpose LLM to create guiding questions, and the second module addresses hallucinations by offline contrastive strategy.
  • Figure 4: Data processing and analysis. During the image collecting and processing stage, we primarily focus on the diversity and richness of image content. In the resulting ScaleCap-450k, the captions are significantly longer than those in other datasets.
  • Figure 5: The benchmark performance under different number of pretraining data.
  • ...and 4 more figures