Table of Contents
Fetching ...

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR

This work tackles instability when optimizing modern multimodal captioning metrics by introducing Direct CLIP-Based Optimization (DiCO), which distills a reward model from a learnable captioning evaluator and integrates it directly into the captioner with a KL regularizer to prevent divergence. The method derives a PPO-inspired objective with a distilled reward, and trains the reward model via a contrastive winner-vs-losers objective, enabling gradient-based optimization without reinforcement learning at fine-tuning. Empirically, DiCO achieves state-of-the-art performance on CLIP-based metrics and retrieval measures while maintaining competitive traditional metrics and demonstrating stable training across diverse backbones and datasets. Overall, DiCO offers a practical, human-aligned, and robust alternative to SCST and RLHF for modern image captioning, with strong generalization to out-of-domain data.

Abstract

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO.

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

TL;DR

This work tackles instability when optimizing modern multimodal captioning metrics by introducing Direct CLIP-Based Optimization (DiCO), which distills a reward model from a learnable captioning evaluator and integrates it directly into the captioner with a KL regularizer to prevent divergence. The method derives a PPO-inspired objective with a distilled reward, and trains the reward model via a contrastive winner-vs-losers objective, enabling gradient-based optimization without reinforcement learning at fine-tuning. Empirically, DiCO achieves state-of-the-art performance on CLIP-based metrics and retrieval measures while maintaining competitive traditional metrics and demonstrating stable training across diverse backbones and datasets. Overall, DiCO offers a practical, human-aligned, and robust alternative to SCST and RLHF for modern image captioning, with strong generalization to out-of-domain data.

Abstract

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO.
Paper Structure (13 sections, 9 equations, 15 figures, 10 tables)

This paper contains 13 sections, 9 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Comparison between SCST rennie2017self and our Direct CLIP-Based Optimization (DiCO). DiCO distills a reward model from a learnable CLIP-based captioning evaluator, without requiring reinforcement learning and preventing reward hacking and divergence.
  • Figure 2: Overview of our approach. Given an image and candidate generations, the figure shows the process for captioner fine-tuning by distilling from a CLIP-based evaluator.
  • Figure 3: Qualitative results on COCO sample images, using PAC-S as reward.
  • Figure 4: Metric curves when optimizing CLIP-S (top) and PAC-S (bottom) scores with DiCO and SCST. The red dot indicates the early stopping point we employ.
  • Figure 5: CIDEr, CLIP-S, and PAC-S scores when changing the $\beta$ parameter using ViT-L/14 as backbone. Higher $\beta$ values prevent the model from deviating from the pre-trained captioner, while penalizing reference-free metrics. The best trade-off is given by $\beta=0.2$.
  • ...and 10 more figures