VLRM: Vision-Language Models act as Reward Models for Image Captioning
Maksim Dzabraev, Alexander Kunitsyn, Andrei Ivaniuta
TL;DR
This paper tackles the gap where image captions produced by standard models are often shallow due to dataset biases. It introduces VLRM, an unsupervised reinforcement-learning fine-tuning framework that uses vision-language reward models (e.g., CLIP, BLIP2-ITM) as the reward signal to re-shape caption outputs without human-labeled data. The method employs a three-step training loop with a learnable value head and an A2C-style policy update, augmented by penalties for bad phrases and repetition, and can include a reference model for naturalness. Empirically, VLRM and its retrieval-specialization variant achieve substantial gains in CLIP Recall and R@1 on MS-COCO Karpathy test split, producing longer, richer, and more color-rich captions while maintaining grammatical quality. This work demonstrates a practical, low-overhead path to significantly improve image captioning and points toward broader applicability of vision-language rewards in multimodal systems.
Abstract
In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are available at https://huggingface.co/sashakunitsyn/vlrm-blip2-opt-2.7b.
