Table of Contents
Fetching ...

Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

Jingcheng Deng, Zhongtao Jiang, Liang Pang, Liwei Chen, Kun Xu, Zihao Wei, Huawei Shen, Xueqi Cheng

TL;DR

This paper addresses the mismatch between generative next-token objectives and discriminative contrastive learning for text embeddings. It introduces AutoRegEmbed, a two-task framework combining information compression and conditional distribution alignment to produce autoregressive, globally meaningful embeddings while maintaining alignment and uniformity. Empirical results show AutoRegEmbed outperforms traditional contrastive methods under the same compute, and matches or exceeds state-of-the-art performance with substantially less training data, across semantic similarity and retrieval benchmarks. The approach offers a scalable, efficient path to high-quality LLM embeddings with broad applicability to retrieval, similarity assessment, and downstream tasks, though it requires careful data curation to mitigate safety and bias concerns.

Abstract

A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.

Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

TL;DR

This paper addresses the mismatch between generative next-token objectives and discriminative contrastive learning for text embeddings. It introduces AutoRegEmbed, a two-task framework combining information compression and conditional distribution alignment to produce autoregressive, globally meaningful embeddings while maintaining alignment and uniformity. Empirical results show AutoRegEmbed outperforms traditional contrastive methods under the same compute, and matches or exceeds state-of-the-art performance with substantially less training data, across semantic similarity and retrieval benchmarks. The approach offers a scalable, efficient path to high-quality LLM embeddings with broad applicability to retrieval, similarity assessment, and downstream tasks, though it requires careful data curation to mitigate safety and bias concerns.

Abstract

A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.

Paper Structure

This paper contains 39 sections, 12 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Comparison of pareto front between AutoRegEmbed and other methods. The horizontal axis represents the number of training samples, while the vertical axis indicates the average performance across 10 STS datasets. The upper left corner represents the region with the highest learning efficiency.
  • Figure 2: Overall framework of AutoRegEmbed. Firstly, we perform the information compression task to inject key information from the context and instruction into the compressed tokens. Then, we optimize the conditional probability distribution of these tokens to align the distributions of $e_{q,I_{\mathrm{next}}}$ and $e_{d^+,I_{\mathrm{self}}}$ as closely as possible through $S_1(q,d^+)$, while increasing the probability of $e_{q,I_{\mathrm{next}}}$ generating positive samples and reducing the probability of $e_{q,I_{\mathrm{next}}}$ generating negative samples through $S_2(d^+,d^-;q)$. Encoder and decoder share a structure.
  • Figure 3: We evaluate the learning efficiency of our method against traditional contrastive learning on 10 STS datasets, comparing their performance under the same number of samples. Further details are provided in Appendix \ref{['app:details']}.