Table of Contents
Fetching ...

CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Marios Krestenitis, Christos Tzelepis, Konstantinos Ioannidis, Steafanos Vrochidis, Ioannis Kompatsiaris, Georgios Tzimiropoulos, Shaogang Gong, Ioannis Patras

Abstract

Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.

CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Abstract

Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.
Paper Structure (27 sections, 3 equations, 9 figures, 12 tables)

This paper contains 27 sections, 3 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Overview of the CycleCap framework. The Visual-Language Model generates multiple captions $\{y_i\}_{i=1}^{n}$ for an image $x$. Each caption is used by a frozen Image Generation Model to reconstruct an image $x'_i = G(y_i)$, whilst the similarity between $x$ and $G(y_i)$ is measured to obtain the cycle consistency reward $R_i$, $i=1,\ldots,n$. These rewards guide fine-tuning of the Visual--Language Model via GRPO to encourage captions that better reflect visual content in a self-supervised manner.
  • Figure 2: Win-rates (%) of (a) CycleCap fine-tuned models versus the corresponding baseline and (b) Qwen2-VL-7B wang2024qwen2 fine-tuned with different methods versus the baseline for CompreCap lu2025benchmarking benchmark.
  • Figure 3: Qualitative comparison of captions generated by the baseline Qwen2-VL-7B wang2024qwen2 and the proposed CycleCap on samples from the CapsBench liu2024playground dataset. Newly added information is indicated in bold. Our method produces more detailed, organized, and accurate descriptions compared to the baseline outputs.
  • Figure 4: Qualitative comparison of image reconstructions with captions generated by SOTA and our method (CycleCap) deployed for Qwen2-VL-7B. The visualization shows that the model fine-tuned with CycleCap captures more structural details and object attributes, leading to reconstructions closer to the original image. The generated captions are provided in the Appendix.
  • Figure A1: The designed prompt of the CycleCap framework used to generate image captions during training.
  • ...and 4 more figures