Table of Contents
Fetching ...

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, Manjot Bilkhu

TL;DR

RubiCap is a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations.

Abstract

Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

TL;DR

RubiCap is a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations.

Abstract

Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.
Paper Structure (41 sections, 5 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 41 sections, 5 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the RubiCap framework. A committee of VLMs first produces diverse candidate captions and consolidate them into a consensus. An LLM rubric writer then diagnoses the student's specific deficiencies and transforms them into fine-grained, interpretable evaluation criteria. An LLM judge applies these rubrics to assess caption rollouts, replacing coarse scalar rewards with structured, multi-dimensional signals that alleviate the verification bottleneck in RL.
  • Figure 2: Evaluation of RubiCap-7B models trained on PixMoCap and DenseFusion datasets. The top row shows results for PixMoCap, and the bottom row for DenseFusion. For each setting: Left: CapArena win rates against the base model across training steps, demonstrating consistent and strongest self-improvement over SFT variants and RL baselines. Middle: CapArena win rates against high-quality human-expert annotations (PixMoCap) and GPT-4V-augmented captions (DenseFusion), showing RubiCap surpasses professional and proprietary labeling systems. Right: Radar plots over 10 VLM benchmarks, validating RubiCap has better knowledge preservation of pretrained capabilities over supervised distillation. Average performance scores are provided in the boxes.
  • Figure 3: Performance of RubiCap-3B across PixMoCap and DenseFusion settings.Left: Results under PixMoCap. Right: Results under DenseFusion. In both settings, RubiCap-3B achieves the highest CapArena win rates among all compared methods.
  • Figure 4: Left: Rank Distribution per Model. Using the PixMoCap setting as an example, RubiCap achieves a substantially higher proportion of rank-1 assignments, demonstrating consistently preferred captions despite its smaller scale. Right: Sub-metric Breakdown per Model. Scores across four evaluation dimensions show that RubiCap achieve lower hallucination penalty and stronger accuracy and clarity, while matching the 72B model overall.
  • Figure 5: Left: Comparison with CapRL-3B in CapArena Evaluation. RubiCap variants achieve higher win rates across both 3B and 2B model sizes while using 25% less training data. Right: Comparison between rubric-augmented SFT and RubiCap across model scales. Even when provided identical rubrics, RubiCap consistently achieves higher win rates.
  • ...and 4 more figures