Table of Contents
Fetching ...

GRACE: Generative Representation Learning via Contrastive Policy Optimization

Jiashuo Sun, Shixuan Liu, Zhaochen Su, Xianrui Zhong, Pengcheng Jiang, Bowen Jin, Peiran Li, Weijia Shi, Jiawei Han

TL;DR

GRACE reframes contrastive learning for text representations by turning contrastive signals into rewards that guide a rationale-generating policy. The LLM becomes an interpretable agent that outputs explicit rationales, which are pooled to form embeddings, enabling both strong semantic representations and inspectable reasoning. Empirically, Grace delivers broad improvements on the MTEB benchmark across multiple backbones in both supervised and unsupervised settings while preserving general-domain capabilities, unlike naive contrastive fine-tuning. The work provides a principled bridge between generation and representation learning, introducing a scalable framework that yields interpretable embeddings and robust downstream performance. Code, data, and models are released to support reproducibility and further research.

Abstract

Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales--structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.

GRACE: Generative Representation Learning via Contrastive Policy Optimization

TL;DR

GRACE reframes contrastive learning for text representations by turning contrastive signals into rewards that guide a rationale-generating policy. The LLM becomes an interpretable agent that outputs explicit rationales, which are pooled to form embeddings, enabling both strong semantic representations and inspectable reasoning. Empirically, Grace delivers broad improvements on the MTEB benchmark across multiple backbones in both supervised and unsupervised settings while preserving general-domain capabilities, unlike naive contrastive fine-tuning. The work provides a principled bridge between generation and representation learning, introducing a scalable framework that yields interpretable embeddings and robust downstream performance. Code, data, and models are released to support reproducibility and further research.

Abstract

Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales--structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.

Paper Structure

This paper contains 48 sections, 28 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 2: Comparison of standard contrastive learning (top) and our RL-based method (bottom). Given a query with positive ($D^+$) and negative ($D^-$) documents, our policy model generates rationales for $q$, $D^+$, and $D^-$, concatenates them to obtain the final representation, and is optimized with rewards that increase similarity between the $q$ and $D^+$ while decreasing similarity between the $q$ and $D^-$.
  • Figure 3: Reward function ablation study for Grace-3B showing performance across different combinations of the consistency weight ($\lambda_1$) and the hard negative mining weight ($\lambda_2$). Left: supervised training; Right: unsupervised training. The heat intensity indicates performance levels, with darker red representing higher scores.
  • Figure 4: Efficiency comparison of different embedding approaches.
  • Figure 5: Training progression of Grace models. Left: accuracy on subtasks steadily improves with more training steps. Right: response length also increases, reflecting enhanced information density and richer reasoning chains.
  • Figure 6: Comparison of different token representation methods across Grace model variants. Mean pooling from both last layer (LL) and penultimate layer (PL) consistently outperform EOS token and max pooling approaches in both supervised and unsupervised settings.
  • ...and 1 more figures