Table of Contents
Fetching ...

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Hyojin Bahng, Caroline Chan, Fredo Durand, Phillip Isola

TL;DR

The paper introduces CycleReward, a self-supervised reward model for image-text alignment learned from cycle-consistency signals without human labels. By mapping text back into image space (and vice versa), it derives pairwise preferences that train a fast, differentiable reward model (CyclePrefDB contains 866K comparisons). CycleReward achieves state-of-the-art results on detailed captioning and competitive performance for text-to-image generation, while enabling effective Best-of-N optimization and direct preference optimization (DPO) without human supervision. The work demonstrates that cycle-consistency signals scale to dense, long-form multimodal descriptions and can improve a broad range of vision-language and diffusion-based generation tasks.

Abstract

Measuring alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset, CycleReward, outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling, while maintaining speed and differentiability. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are publicly released at https://cyclereward.github.io.

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

TL;DR

The paper introduces CycleReward, a self-supervised reward model for image-text alignment learned from cycle-consistency signals without human labels. By mapping text back into image space (and vice versa), it derives pairwise preferences that train a fast, differentiable reward model (CyclePrefDB contains 866K comparisons). CycleReward achieves state-of-the-art results on detailed captioning and competitive performance for text-to-image generation, while enabling effective Best-of-N optimization and direct preference optimization (DPO) without human supervision. The work demonstrates that cycle-consistency signals scale to dense, long-form multimodal descriptions and can improve a broad range of vision-language and diffusion-based generation tasks.

Abstract

Measuring alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset, CycleReward, outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling, while maintaining speed and differentiability. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are publicly released at https://cyclereward.github.io.

Paper Structure

This paper contains 51 sections, 11 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Method overview. (1) Given an input image $x$, we generate multiple candidate captions $F_i(x)$, $F_j(x)$ using different captioning models. Each caption is mapped back to the image domain via a text-to-image model $G$, and compared against the original image. Captions whose reconstructions $G(F(x))$ are more similar to the original image are preferred; those with low similarity are rejected. (2) These comparison pairs are used to train a reward model, which learns to assign higher scores to preferred captions. We apply the same process for text-to-image generation. (3) At test time, the trained reward model outputs alignment scores for arbitrary image-text pairs.
  • Figure 2: What do cycle consistency preferences look like? We visualize comparison pairs from our dataset, where cycle consistency determines preferences. Preferred samples are in blue and rejected samples are in red. Image-to-text generation (left): The preferred caption provides a fine-grained description resulting in a faithful reconstruction of the original image, whereas the rejected caption is short and vague, producing a reconstruction far from the original image. Text-to-image generation (right): Images that capture fine-grained details of the input prompt produce better text reconstructions, resulting in higher cycle consistency. See Appendix \ref{['appendix:dataset_recons']} for more examples.
  • Figure 3: Best-of-$N$ relative performance gain. From left to right: LLaVA-W, DeCapBench, T2I-CompBench (mean of 6 categories), and PartiPrompts (complex). In each plot, we show the relative performance gain from BoN sampling with different metrics. Feedback from our reward model leads to the greatest overall improvement for detailed captioning tasks, while we maintain competitive text-to-image generation performance with VQAScore and ImageReward.
  • Figure 4: DPO results using CyclePrefDB. (Left) Using CyclePrefDB-I2T for DPO improves Qwen-VL-Chat, yielding denser captions that describe fine-grained details of the input image. (Right) Using CyclePrefDB-T2I for Diffusion DPO improves Stable Diffusion 1.5, producing images that better capture the details in the input prompt.
  • Figure 5: Raw cycle consistency performance with increasing number of samples. We plot DetailCaps-4870 benchmark performance (Pairwise Accuracy) for raw cycle consistency calculated over multiple samples (random seed sampling). Despite the increasing number of seeds, raw cycle consistency performance does not come close to reward model performance.
  • ...and 11 more figures