Table of Contents
Fetching ...

OPCap:Object-aware Prompting Captioning

Feiyang Huang

TL;DR

OPCap addresses object hallucination in image captioning by incorporating explicit object-level information through an object detector and attribute predictor. It avoids external language models and large-scale data, instead embedding object tokens into a Transformer-based decoder. Across COCO and nocaps, OPCap reduces hallucinations (CHAIRs/CHAIRi) and improves caption quality, though performance depends on detector quality and encoder choice. This approach provides a resource-efficient, fine-grained representation that aids robust captioning in diverse and unseen scenarios.

Abstract

In the field of image captioning, the phenomenon where missing or nonexistent objects are used to explain an image is referred to as object bias (or hallucination). To mitigate this issue, we propose a target-aware prompting strategy. This method first extracts object labels and their spatial information from the image using an object detector. Then, an attribute predictor further refines the semantic features of the objects. These refined features are subsequently integrated and fed into the decoder, enhancing the model's understanding of the image context. Experimental results on the COCO and nocaps datasets demonstrate that OPCap effectively mitigates hallucination and significantly improves the quality of generated captions.

OPCap:Object-aware Prompting Captioning

TL;DR

OPCap addresses object hallucination in image captioning by incorporating explicit object-level information through an object detector and attribute predictor. It avoids external language models and large-scale data, instead embedding object tokens into a Transformer-based decoder. Across COCO and nocaps, OPCap reduces hallucinations (CHAIRs/CHAIRi) and improves caption quality, though performance depends on detector quality and encoder choice. This approach provides a resource-efficient, fine-grained representation that aids robust captioning in diverse and unseen scenarios.

Abstract

In the field of image captioning, the phenomenon where missing or nonexistent objects are used to explain an image is referred to as object bias (or hallucination). To mitigate this issue, we propose a target-aware prompting strategy. This method first extracts object labels and their spatial information from the image using an object detector. Then, an attribute predictor further refines the semantic features of the objects. These refined features are subsequently integrated and fed into the decoder, enhancing the model's understanding of the image context. Experimental results on the COCO and nocaps datasets demonstrate that OPCap effectively mitigates hallucination and significantly improves the quality of generated captions.

Paper Structure

This paper contains 13 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Examples of hallucination
  • Figure 2: OPCap Architecture: The architecture consists of three modules, including the image encoder, object detector + attribute predictor, and text decoder.