Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Hyojin Bahng; Caroline Chan; Fredo Durand; Phillip Isola

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Hyojin Bahng, Caroline Chan, Fredo Durand, Phillip Isola

TL;DR

The paper introduces CycleReward, a self-supervised reward model for image-text alignment learned from cycle-consistency signals without human labels. By mapping text back into image space (and vice versa), it derives pairwise preferences that train a fast, differentiable reward model (CyclePrefDB contains 866K comparisons). CycleReward achieves state-of-the-art results on detailed captioning and competitive performance for text-to-image generation, while enabling effective Best-of-N optimization and direct preference optimization (DPO) without human supervision. The work demonstrates that cycle-consistency signals scale to dense, long-form multimodal descriptions and can improve a broad range of vision-language and diffusion-based generation tasks.

Abstract

Measuring alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset, CycleReward, outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling, while maintaining speed and differentiability. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are publicly released at https://cyclereward.github.io.

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

TL;DR

Abstract

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)