Table of Contents
Fetching ...

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Manu Gaur, Darshan Singh, Makarand Tapaswi

TL;DR

The paper tackles the difficulty of producing fine-grained, faithful image captions by addressing data quality, evaluation, and training biases. It introduces Visual Caption Boosting (VCB) to densify captions grounded in human annotations, and TrueMatch to rigorously evaluate fine-grained captioning via self-retrieval. A curriculum-based self-retrieval training regime (BagCurri) with joint CLIP/GPT-2 fine-tuning and optional CIDEr augmentation yields substantial gains over vanilla SR, even beating larger models on TrueMatch. The work highlights the limitations of standard metrics for fine-grained captioning and provides a practical, scalable recipe for improving discriminative captions while preserving faithfulness, with broad implications for captioning benchmarks and evaluation.

Abstract

Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

TL;DR

The paper tackles the difficulty of producing fine-grained, faithful image captions by addressing data quality, evaluation, and training biases. It introduces Visual Caption Boosting (VCB) to densify captions grounded in human annotations, and TrueMatch to rigorously evaluate fine-grained captioning via self-retrieval. A curriculum-based self-retrieval training regime (BagCurri) with joint CLIP/GPT-2 fine-tuning and optional CIDEr augmentation yields substantial gains over vanilla SR, even beating larger models on TrueMatch. The work highlights the limitations of standard metrics for fine-grained captioning and provides a practical, scalable recipe for improving discriminative captions while preserving faithfulness, with broad implications for captioning benchmarks and evaluation.

Abstract

Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.
Paper Structure (75 sections, 2 equations, 10 figures, 15 tables, 2 algorithms)

This paper contains 75 sections, 2 equations, 10 figures, 15 tables, 2 algorithms.

Figures (10)

  • Figure 1: For similar images, captioning systems struggle to generate meaningful captions that uniquely describe each image. In this example, COCO MLE: A model trained on COCO with MLE generates the same generic caption for the first two images. COCO SRdessi2023cross: While the self-retrieval (SR) objective may help, the COCO captions are not rich enough to generate salient visual details. OUR SR: Our improved data and training recipe results in fine-grained, and therefore discriminant captions.
  • Figure 2: Example of Visual Caption Boosting transforming the original human annotated captions (left) to a Holistic Caption. First, an LLM blends the human annotations to create a Blended Caption. Next, an MLLM generates a dense visual caption that may be noisy. Finally, we create a Holistic Caption by instructing the LLM to incorporate fine-grained details from the Visual Caption with the Blended Caption, while staying anchored in human annotations in case of conflicts. Specific prompts are shared in \ref{['app:vcb_prompts']}. The colors indicate various concepts extracted from the human annotations or the visual caption. The red underlined text (illustrated by us for ease of understanding) indicating hallucinations or verbose text, is ignored in the Holistic Caption as we anchor it to human annotations.
  • Figure 3: RD100 R@1 continually increases while CIDEr degrades when fine-tuning ClipCap with SR-L on COCO for 100 epochs.
  • Figure 4: Histogram of number of words for COCO, BlendCap, and HolisticCap. HolisticCap is more descriptive with 41.5 words per caption on average, while COCO only has 10.5 words on average.
  • Figure 5: Our curriculum over bag sizes during training.
  • ...and 5 more figures