Table of Contents
Fetching ...

Generating Accurate and Detailed Captions for High-Resolution Images

Hankyeol Lee, Gawon Seo, Kyounggyu Lee, Dogun Kim, Kyungwoo Song, Jiyoung Jung

TL;DR

The paper tackles the difficulty of generating accurate captions for high-resolution images by addressing the resolution mismatch in vision-language models. It introduces a training-free pipeline that combines a VLM, an LLM, and open-vocabulary detectors to progressively refine captions, verify object presence, and generate region-specific details, followed by a rephrasing step to ensure coherence. Through pairwise and POPE-based evaluations on a high-resolution Objects365 subset, the approach yields more detailed and reliable captions and reduces hallucinations across multiple captioners. The work showcases the practical value of cross-modal verification and zoom-in captioning for high-resolution scenes, while highlighting computational latency and detector-dependence as avenues for future improvement.

Abstract

Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.

Generating Accurate and Detailed Captions for High-Resolution Images

TL;DR

The paper tackles the difficulty of generating accurate captions for high-resolution images by addressing the resolution mismatch in vision-language models. It introduces a training-free pipeline that combines a VLM, an LLM, and open-vocabulary detectors to progressively refine captions, verify object presence, and generate region-specific details, followed by a rephrasing step to ensure coherence. Through pairwise and POPE-based evaluations on a high-resolution Objects365 subset, the approach yields more detailed and reliable captions and reduces hallucinations across multiple captioners. The work showcases the practical value of cross-modal verification and zoom-in captioning for high-resolution scenes, while highlighting computational latency and detector-dependence as avenues for future improvement.

Abstract

Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.

Paper Structure

This paper contains 18 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the caption refinement pipeline. The process begins with initial caption generation using a captioner (VLM). Next, potentially co-occurring objects are identified with the help of a large language model, followed by verifying the existence of objects using a detector. Subsequently, detailed captioning is performed to incorporate newly detected objects. Finally, the enhanced caption is generated by rephrasing with the LLM. The pipeline enhances the image caption to improve its accuracy and descriptive detail.
  • Figure 2: Winning rates for pairwise caption comparison.
  • Figure 3: Qualitative comparison of initial and enhanced caption. The green text represents newly added information, and the red text indicates hallucination or incorrect information. Our method generates detailed and reliable captions.