Table of Contents
Fetching ...

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun

TL;DR

RICO tackles hallucination and incompleteness in image recaptioning by introducing visual reconstruction to expose cross-modal misalignments and guide caption refinement. The methodology alternates between reconstructing a caption into a reference image and refining the caption based on discrepancies, enabling more faithful and comprehensive descriptions. An efficient variant, RICO-Flash, leverages Direct Preference Optimization to mimic iterative refinements in a single end-to-end model, reducing computational cost. Empirical results show substantial improvements over baselines on CapsBench and CompreCap, with strong generalization across initial captions and prompts, and improved downstream text-to-image fidelity. The work advances semantic alignment between images and captions, with practical implications for high-quality synthetic data for multimodal tasks.

Abstract

Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

TL;DR

RICO tackles hallucination and incompleteness in image recaptioning by introducing visual reconstruction to expose cross-modal misalignments and guide caption refinement. The methodology alternates between reconstructing a caption into a reference image and refining the caption based on discrepancies, enabling more faithful and comprehensive descriptions. An efficient variant, RICO-Flash, leverages Direct Preference Optimization to mimic iterative refinements in a single end-to-end model, reducing computational cost. Empirical results show substantial improvements over baselines on CapsBench and CompreCap, with strong generalization across initial captions and prompts, and improved downstream text-to-image fidelity. The work advances semantic alignment between images and captions, with practical implications for high-quality synthetic data for multimodal tasks.

Abstract

Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

Paper Structure

This paper contains 35 sections, 6 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Analysis of image captions generated by Qwen2-VL and its recaptioned variants. Despite the advanced capabilities of Qwen2-VL, the generated captions still contain incorrect or ambiguous information---for example, misidentifying the number of buses---a mistake that remains uncorrected even by GPT-4o. Furthermore, both GPT-4o and human-generated recaptions often overlook fine-grained details, such as attributes and spatial relationships, which are accurately captured by our model. By reconstructing images from captions, it becomes evident that our model better preserves such details, resulting in reconstructions that more closely resemble the original image.
  • Figure 2: Illustration of the motivation for introducing the visual reconstruction mechanism. Conventional recaptioning methods typically map images directly to text without explicitly aligning the semantic spaces of the two modalities, often leading to information loss in the generated captions. In contrast, our approach incorporates visual reconstruction to make this loss more observable. By identifying discrepancies between the original and reconstructed images through the reviser, we refine the caption to produce a more semantically aligned and comprehensive description.
  • Figure 3: Illustration of the iterative process of RICO. After the initial captioning step, a reconstruction procedure is applied to generate an image from the candidate caption. The caption is then refined by comparing the original image with the reconstructed image.
  • Figure 4: Performance of the RICO pipeline under different numbers of refinement iterations.
  • Figure 5: An example demonstrating the iterative refinement process performed by our model, where red text indicates added or corrected information.
  • ...and 1 more figures