Table of Contents
Fetching ...

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang

TL;DR

CCCaption is introduced: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria.

Abstract

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

TL;DR

CCCaption is introduced: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria.

Abstract

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.
Paper Structure (12 sections, 8 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 8 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Comparison of incomplete and incorrect captions generated by MLLMs. An example of incomplete caption with missing information versus an incorrect caption with hallucinated details. (b) Reward and correctness curves during model training, illustrating the progression of reward, correctness (as defined in Section \ref{['method']}), and caption length across training steps.
  • Figure 2: Illustrations of our Complete and Correct Captioning (CCCaption) framework and Complete Caption dataset (CCaption) generation pipeline. (a) The CCCaption reinforcement learning framework. The green and red parts represent the processes for computing the correctness and completeness rewards, respectively. We combine these two rewards using the GRPO algorithm guo2025deepseek to perform reinforcement learning training for the captioning model, while employing the dynamic query sampling strategy to enhance training efficiency. (b) The CCaption dataset generation pipeline. By iterating through the entire process, a diverse, complete query set is generated, ensuring thorough coverage of all the image information.
  • Figure 3: Case analysis across different captioning models. From left to right, the captioning models include Qwen3-VL-2B qwen3, CapRL-3B caprl, Qwen3-VL-32B qwen3, and CCCaption-2B (Ours), with image caption outputs for different queries labeled for hallucination, forgetfulness, and nitpicking. Both cases are derived from the MMBench dataset mmbench.
  • Figure 4: Query and image embedding scatters across different datasets. "Vanilla" refers to queries generated by a single MLLM, using the CapRL method caprl, while "CCaption-44k" represents our approach, which utilizes multiple MLLMs for generation and employs a diversity metric to measure query diversity. The embedding model used is Ops-MM-embedding ops_mm_embedding_v1_7B, with dimensionality reduction performed using t-SNE maaten2008visualizing.
  • Figure 5: Performance whether the dynamic query sampling strategy is used during training. "w Dynamic" denotes the use of the strategy, while "w/o Dynamic" indicates the absence of the strategy. "Acc." refers to the accuracy under the Prism evaluation qiao2024prism.