Table of Contents
Fetching ...

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR

Self-Cap addresses limitations of CIDEr-only and CLIP-based rewards in image captioning by learning a discriminative reward model trained with self-generated negatives. It uses a two-stage approach: fine-tune a CLIP-based discriminator on hard negatives derived from captioners and then optimize the captioner with this learnable reward via a SCST-like policy gradient. Experiments on COCO and zero-shot/out-of-domain datasets (nocaps, VizWiz, CC3M) show stronger descriptive quality, improved grammaticality, and reduced repetition compared with CLIP-S or PAC-S rewards. The method achieves competitive state-of-the-art results on supervised metrics and demonstrates robust generalization, with practical benefits of reduced training time. This work contributes a practical framework for grammar-aware, semantically-rich captioning using self-supervised rewards.

Abstract

Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

TL;DR

Self-Cap addresses limitations of CIDEr-only and CLIP-based rewards in image captioning by learning a discriminative reward model trained with self-generated negatives. It uses a two-stage approach: fine-tune a CLIP-based discriminator on hard negatives derived from captioners and then optimize the captioner with this learnable reward via a SCST-like policy gradient. Experiments on COCO and zero-shot/out-of-domain datasets (nocaps, VizWiz, CC3M) show stronger descriptive quality, improved grammaticality, and reduced repetition compared with CLIP-S or PAC-S rewards. The method achieves competitive state-of-the-art results on supervised metrics and demonstrates robust generalization, with practical benefits of reduced training time. This work contributes a practical framework for grammar-aware, semantically-rich captioning using self-supervised rewards.

Abstract

Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.
Paper Structure (14 sections, 3 equations, 3 figures, 4 tables)

This paper contains 14 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our approach. On the left, the training strategy of the captioner model is shown. The model acts as an agent providing rewards from a discriminator obtained with textual negatives directly derived from the model itself (right).
  • Figure 2: Overview of our self-discriminator approach, in which both CLIP encoders are fine-tuned with low-rank adaptation (LoRA) using additional textual negatives.
  • Figure 3: Qualitative results on COCO sample images, comparing Self-Cap with a model trained using PAC-S as reward.