Table of Contents
Fetching ...

VLRM: Vision-Language Models act as Reward Models for Image Captioning

Maksim Dzabraev, Alexander Kunitsyn, Andrei Ivaniuta

TL;DR

This paper tackles the gap where image captions produced by standard models are often shallow due to dataset biases. It introduces VLRM, an unsupervised reinforcement-learning fine-tuning framework that uses vision-language reward models (e.g., CLIP, BLIP2-ITM) as the reward signal to re-shape caption outputs without human-labeled data. The method employs a three-step training loop with a learnable value head and an A2C-style policy update, augmented by penalties for bad phrases and repetition, and can include a reference model for naturalness. Empirically, VLRM and its retrieval-specialization variant achieve substantial gains in CLIP Recall and R@1 on MS-COCO Karpathy test split, producing longer, richer, and more color-rich captions while maintaining grammatical quality. This work demonstrates a practical, low-overhead path to significantly improve image captioning and points toward broader applicability of vision-language rewards in multimodal systems.

Abstract

In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are available at https://huggingface.co/sashakunitsyn/vlrm-blip2-opt-2.7b.

VLRM: Vision-Language Models act as Reward Models for Image Captioning

TL;DR

This paper tackles the gap where image captions produced by standard models are often shallow due to dataset biases. It introduces VLRM, an unsupervised reinforcement-learning fine-tuning framework that uses vision-language reward models (e.g., CLIP, BLIP2-ITM) as the reward signal to re-shape caption outputs without human-labeled data. The method employs a three-step training loop with a learnable value head and an A2C-style policy update, augmented by penalties for bad phrases and repetition, and can include a reference model for naturalness. Empirically, VLRM and its retrieval-specialization variant achieve substantial gains in CLIP Recall and R@1 on MS-COCO Karpathy test split, producing longer, richer, and more color-rich captions while maintaining grammatical quality. This work demonstrates a practical, low-overhead path to significantly improve image captioning and points toward broader applicability of vision-language rewards in multimodal systems.

Abstract

In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are available at https://huggingface.co/sashakunitsyn/vlrm-blip2-opt-2.7b.
Paper Structure (22 sections, 6 equations, 2 figures, 6 tables)

This paper contains 22 sections, 6 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The illustration of the training iteration.
  • Figure 2: The architecture of the value head.