Fluent and Accurate Image Captioning with a Self-Trained Reward Model
Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
TL;DR
Self-Cap addresses limitations of CIDEr-only and CLIP-based rewards in image captioning by learning a discriminative reward model trained with self-generated negatives. It uses a two-stage approach: fine-tune a CLIP-based discriminator on hard negatives derived from captioners and then optimize the captioner with this learnable reward via a SCST-like policy gradient. Experiments on COCO and zero-shot/out-of-domain datasets (nocaps, VizWiz, CC3M) show stronger descriptive quality, improved grammaticality, and reduced repetition compared with CLIP-S or PAC-S rewards. The method achieves competitive state-of-the-art results on supervised metrics and demonstrates robust generalization, with practical benefits of reduced training time. This work contributes a practical framework for grammar-aware, semantically-rich captioning using self-supervised rewards.
Abstract
Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.
