Table of Contents
Fetching ...

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

Antoine Chaffin, Ewa Kijak, Vincent Claveau

TL;DR

This work addresses the demand for highly distinctive image captions by enriching CLIP-guided reinforcement learning with ground-truth captions. It introduces three GT-based contributions: a CLIP-space discriminator to regularize generation, reward-weighted teacher forcing that grounds exploration to human-written captions, and a bidirectional contrastive reward that leverages batch-wide strongest baselines in both image-to-text and text-to-image directions. Experiments on MS-COCO show that these components improve retrieval-oriented distinctiveness while preserving writing quality, with GT trajectories helping to combat reward hacking and vocabulary collapse. The approach provides a robust framework for producing informative captions suitable for retrieval and accessibility, and suggests avenues for jointly training CLIP models and captioners in the future.

Abstract

Training image captioning models using teacher forcing results in very generic samples, whereas more distinctive captions can be very useful in retrieval applications or to produce alternative texts describing images for accessibility. Reinforcement Learning (RL) allows to use cross-modal retrieval similarity score between the generated caption and the input image as reward to guide the training, leading to more distinctive captions. Recent studies show that pre-trained cross-modal retrieval models can be used to provide this reward, completely eliminating the need for reference captions. However, we argue in this paper that Ground Truth (GT) captions can still be useful in this RL framework. We propose a new image captioning model training strategy that makes use of GT captions in different ways. Firstly, they can be used to train a simple MLP discriminator that serves as a regularization to prevent reward hacking and ensures the fluency of generated captions, resulting in a textual GAN setup extended for multimodal inputs. Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image. This objective acts as an additional learning signal grounded to the distribution of the GT captions. Thirdly, they can serve as strong baselines when added to the pool of captions used to compute the proposed contrastive reward to reduce the variance of gradient estimate. Experiments on MS-COCO demonstrate the interest of the proposed training strategy to produce highly distinctive captions while maintaining high writing quality.

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

TL;DR

This work addresses the demand for highly distinctive image captions by enriching CLIP-guided reinforcement learning with ground-truth captions. It introduces three GT-based contributions: a CLIP-space discriminator to regularize generation, reward-weighted teacher forcing that grounds exploration to human-written captions, and a bidirectional contrastive reward that leverages batch-wide strongest baselines in both image-to-text and text-to-image directions. Experiments on MS-COCO show that these components improve retrieval-oriented distinctiveness while preserving writing quality, with GT trajectories helping to combat reward hacking and vocabulary collapse. The approach provides a robust framework for producing informative captions suitable for retrieval and accessibility, and suggests avenues for jointly training CLIP models and captioners in the future.

Abstract

Training image captioning models using teacher forcing results in very generic samples, whereas more distinctive captions can be very useful in retrieval applications or to produce alternative texts describing images for accessibility. Reinforcement Learning (RL) allows to use cross-modal retrieval similarity score between the generated caption and the input image as reward to guide the training, leading to more distinctive captions. Recent studies show that pre-trained cross-modal retrieval models can be used to provide this reward, completely eliminating the need for reference captions. However, we argue in this paper that Ground Truth (GT) captions can still be useful in this RL framework. We propose a new image captioning model training strategy that makes use of GT captions in different ways. Firstly, they can be used to train a simple MLP discriminator that serves as a regularization to prevent reward hacking and ensures the fluency of generated captions, resulting in a textual GAN setup extended for multimodal inputs. Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image. This objective acts as an additional learning signal grounded to the distribution of the GT captions. Thirdly, they can serve as strong baselines when added to the pool of captions used to compute the proposed contrastive reward to reduce the variance of gradient estimate. Experiments on MS-COCO demonstrate the interest of the proposed training strategy to produce highly distinctive captions while maintaining high writing quality.
Paper Structure (22 sections, 5 equations, 2 figures, 1 table)

This paper contains 22 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Examples of images with an overly generic ground truth caption, a caption generated by a model without regularization (leading to reward hacking), and the caption generated by our approach (well-written and distinctive).
  • Figure 2: Proposed captioning model learning overview. Generated and ground-truth captions, as well as input and mined similar images, are projected in the CLIP embedding space. Those representations are used to compute the reward composed of a discriminator score (Section \ref{['ssec:discriminator']}) and a CLIP-based bidirectional contrastive similarity score (Section \ref{['ssec:contrastive_reward']}), for beam search and ground-truth samples (Section \ref{['ssec:wtf']}) (in blue in the reward computation bloc).