Table of Contents
Fetching ...

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao

Abstract

Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Abstract

Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.

Paper Structure

This paper contains 17 sections, 6 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Gap in Multimodal Reasoning. Multimodal large language models (MLLMs) frequently fail the test of modal-invariance. For example, they produce conflicting answers for the same webpage when presented as a screenshot versus its raw HTML source. We introduce a cycle-consistency framework that directly targets this modality gap, leveraging the inconsistency itself as a signal to jointly improve reasoning and alignment.
  • Figure 2: Failure of multimodal voting.Left: Consistent Conflict — both text and image modalities produce self-consistent predictions (mode-stable) but disagree with each other, and only one modality aligns with the ground truth. Right: Unstable Recovery — within a single modality, some rollouts yield the correct answer, but the majority vote remains wrong, reflecting intra-modal instability. Using multimodal voting can amplify biases or lose correct signals.
  • Figure 3: Overview of multimodal cycle consistency. Starting from a potential answer candidate $a_{\text{orig}}$, the model performs backward inference to reconstruct two latent queries, $\hat{q}_T$ from the text view $x_T$ and $\hat{q}_I$ from the image view $x_I$. Each reconstructed query is then used for forward inference across both modalities, resulting in four reconstructed answers $\{a_{tt}, a_{ti}, a_{it}, a_{ii}\}$ generated via the paths $T{\rightarrow}T$, $T{\rightarrow}I$, $I{\rightarrow}T$, and $I{\rightarrow}I$. Cycle consistency is measured by whether the reconstructed answers remain consistent with the original $a_{\text{orig}}$, forming a full 4-way cross-modal reasoning cycle.
  • Figure 4: Examples of the backward-inference (Answer$\to$Query) step. Given a Candidate Answer, the model generates distinct, semantically-grounded queries for both the text and image modalities. This demonstrates the viability of the first step of our cycle-consistency reward, enabling the model to check its answer in the alternate modality.
  • Figure 5: Visual comparison among the base model, voting baseline, and R-C$^2$ (ours). For each example, we display the original multimodal observation, the query, and the predictions from text and image modalities under different methods. The base model and voting baseline often yield either conflicting answers across modalities or spurious agreements on incorrect predictions, reflecting the modality gap and unreliable consensus confidence. In contrast, R-C$^2$ enforces cycle-based alignment, producing answers that are both correct and consistent across modalities.
  • ...and 4 more figures