Table of Contents
Fetching ...

LLM2Vec-Gen: Generative Embeddings from Large Language Models

Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy

TL;DR

This work proposes a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, it learns to represent the model's potential response, and achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark, improving by 9.3% over the best unsupervised embedding teacher.

Abstract

LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

LLM2Vec-Gen: Generative Embeddings from Large Language Models

TL;DR

This work proposes a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, it learns to represent the model's potential response, and achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark, improving by 9.3% over the best unsupervised embedding teacher.

Abstract

LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.
Paper Structure (37 sections, 2 equations, 6 figures, 13 tables)

This paper contains 37 sections, 2 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Illustration of the input-output gap. Semantically distinct queries $q_1$ and $q_2$ belong to the same category (anger). Input-centric encoders place them far apart (yellow), but their LLM responses are more similar, yielding closer representations (green). LLM2Vec-Gen encodes the response rather than the input.
  • Figure 2: Overview of LLM2Vec-Gen. Left: Data generation -- given unlabeled queries, the LLM generates responses which are embedded by an unsupervised teacher. Right: Trainable thought and compression tokens are appended to queries. The compression tokens' hidden states are optimized via reconstruction loss $\mathcal{L}_{\text{recon}}$ (reconstruct the response from soft prompts) and alignment loss $\mathcal{L}_{\text{align}}$ (match the teacher's response embedding). The LLM backbone remains frozen throughout training.
  • Figure 3: MTEB average score as a function of model size across three model families. LLM2Vec-Gen consistently outperforms LLM2Vec across all model sizes and architectures.
  • Figure 4: Impact of special token count on MTEB-Lite performance. Performance improves from 2 to 20 tokens (* our default setting) but shows marginal gains thereafter. In each setting, half are thought tokens and half are compression tokens.
  • Figure 5: The prompt used for the generation phase of generate-then-encode baseline evaluation.
  • ...and 1 more figures