Table of Contents
Fetching ...

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang

TL;DR

A Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process, and provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

TL;DR

A Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process, and provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.
Paper Structure (45 sections, 15 equations, 6 figures, 19 tables)

This paper contains 45 sections, 15 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Intuition of CLIPO. Standard RLVR only relies on outcomes, neglecting the quality of intermediate reasoning steps. CLIPO addresses this limitation by maximizing similarity between successful reasoning trajectories. By aligning multiple positive rollouts, CLIPO identifies the invariant reasoning structure, i.e., the “overlap” of successful paths, implicitly eliminating incorrect and hallucinative reasoning steps.
  • Figure 2: Framework of CLIPO. For each input prompt ${\bm{x}}$, policy optimization methods generate a group of rollouts $\{{\bm{y}}_1,{\bm{y}}_2,\dots,{\bm{y}}_G\}$ then calculate corresponding RLVR rewards $\{r_1,r_2,\dots,r_G\}$. CLIPO applies a contrastive head on the top of the last hidden states $\{{\bm{h}}_1,{\bm{h}}_2,\dots,{\bm{h}}_G\}$ of the rollout group and outputs trajectory-level semantic embeddings $\{{\bm{e}}_1,{\bm{e}}_2,\dots,{\bm{e}}_G\}$. The contrastive rewards $\{r_1^\text{CL},r_2^\text{CL},\dots,r_G^\text{CL}\}$ are computed across the semantic embedding group to provide the similarity of successful and failed trajectories. The final CLIPO reward for the $i$-th rollout is $r'_i=r_i+ r^\text{CL}_i$.
  • Figure 3: Performance Gain Across Different Losses.
  • Figure 4: Comparison among contrastive loss variants. Here, $s_{ia}$ denotes the similarity between $i$ and $a$, while $P$ is the set of positive examples for $i$ and $p^*$ is a single sampled positive example.
  • Figure 4: The t-SNE visualization of semantic embeddings produced by the contrastive head at the start of training (left) and after $3$ epochs of training (right). Green points represent embeddings from correct rollouts, while red points correspond to incorrect ones. After training, correct responses cluster closely together, forming more distinct group clusters. Within these clusters, correct and incorrect responses also exhibit some separation.
  • ...and 1 more figures