Table of Contents
Fetching ...

Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Chi Li, Yuan-Hao Chen, Ching-Yu Tsai, Shou-De Lin

TL;DR

The paper tackles the limitation of encoder-only LLM embeddings by introducing GIRCSE, a generative embedding framework that uses autoregressive soft-token refinement to progressively distill semantic representations. An Iterative Contrastive Refinement (ICR) objective supervises each generation step with a stepwise contrastive loss and a refinement regularization term, enabling end-to-end differentiable training. Empirical results show GIRCSE achieves strong performance on MTEB and instruction-following benchmarks with only 0.2M training data and exhibits test-time scaling, where longer refinement at inference improves embedding quality. This approach offers a new paradigm where generation drives representation learning, balancing generic tasks and instruction-following while maintaining efficiency through differentiable soft-token generation and caching. The work has practical implications for scalable, semantically rich embeddings capable of leveraging richer, instruction-aware semantics.

Abstract

Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.

Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

TL;DR

The paper tackles the limitation of encoder-only LLM embeddings by introducing GIRCSE, a generative embedding framework that uses autoregressive soft-token refinement to progressively distill semantic representations. An Iterative Contrastive Refinement (ICR) objective supervises each generation step with a stepwise contrastive loss and a refinement regularization term, enabling end-to-end differentiable training. Empirical results show GIRCSE achieves strong performance on MTEB and instruction-following benchmarks with only 0.2M training data and exhibits test-time scaling, where longer refinement at inference improves embedding quality. This approach offers a new paradigm where generation drives representation learning, balancing generic tasks and instruction-following while maintaining efficiency through differentiable soft-token generation and caching. The work has practical implications for scalable, semantically rich embeddings capable of leveraging richer, instruction-aware semantics.

Abstract

Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.

Paper Structure

This paper contains 25 sections, 10 equations, 5 figures, 10 tables, 2 algorithms.

Figures (5)

  • Figure 1: Top: Comparison between embedding LLMs that extract static representations and generative LLMs that can iteratively refine through reasoning. Bottom: Overview of GIRCSE. Our framework combines Soft Token Generation and Iterative Contrastive Refinement to enable end-to-end generative training.
  • Figure 2: Effect of generation length at inference. Top: GIRCSE consistently improves with longer generations (10–20 tokens) despite been trained on only 5 tokens. Bottom: Baseline models show degraded or fluctuated performance across generation lengths. Gray area indicates configurations beyond training length.
  • Figure 3: Comparison of average MTEB scores (%) between GIRCSE and two fair baselines across three backbone LLMs and varying training sample sizes. GIRCSE consistently delivers superior performance, especially under limited-data settings.
  • Figure 4: Training loss curve of GIRCSE.
  • Figure 5: Training gradient norm (L2) of GIRCSE, plotted with the top 2% of outliers removed for clarity.