Table of Contents
Fetching ...

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao

TL;DR

This work introduces DiffEmbed, a diffusion-language-model–based approach to text embeddings designed to overcome the bidirectionality gap in autoregressive embeddings. By pre-training diffusion LMs with bidirectional denoising and applying a simple mean-pooling of final-layer representations trained with a contrastive objective, DiffEmbed delivers superior long-document encoding and reasoning performance, surpassing LLM-based embeddings on LongEmbed and Bright benchmarks while remaining competitive on general embedding tasks. A key contribution is ReasonAug, a large reasoning-focused dataset generated via LLM prompts to train embeddings for logical retrieval, revealing the value of bidirectional context for complex theories and algorithms. Overall, the results highlight diffusion embeddings as a promising direction for global-context text representations, with potential for further gains as data and model scale increase.

Abstract

Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

TL;DR

This work introduces DiffEmbed, a diffusion-language-model–based approach to text embeddings designed to overcome the bidirectionality gap in autoregressive embeddings. By pre-training diffusion LMs with bidirectional denoising and applying a simple mean-pooling of final-layer representations trained with a contrastive objective, DiffEmbed delivers superior long-document encoding and reasoning performance, surpassing LLM-based embeddings on LongEmbed and Bright benchmarks while remaining competitive on general embedding tasks. A key contribution is ReasonAug, a large reasoning-focused dataset generated via LLM prompts to train embeddings for logical retrieval, revealing the value of bidirectional context for complex theories and algorithms. Overall, the results highlight diffusion embeddings as a promising direction for global-context text representations, with potential for further gains as data and model scale increase.

Abstract

Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

Paper Structure

This paper contains 45 sections, 2 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: (a) Unidirectional attention in Autoregressive LM. (b) Bidirectional attention in Diffusion LM, i.e.,Dreamdream. (c) Retrieval performance comparison between the diffusion embedding model and the LLM embedding model enhanced with LLM2Vec adaptation llm2vec.
  • Figure 2: Overview of DiffEmbed. Final-layer token representations from the backbone diffusion LM are mean-pooled to obtain text embeddings.
  • Figure 3: Left: data augmentation pipeline. Right: qualitative examples of seed concepts, their definitions, and associated question–solution pairs. A question-to-question retrieval sample can be constructed using a query question $Q_A$, a positive document $(Q_B, S_B)$ generated from the same concept $X$, and a hard negative document $(Q_C, S_C)$ generated from a different concept $X'$. A question-to-concept retrieval sample can consist of a query question $Q_B$, a positive document $D$ (the definition of the relevant concept), and a hard negative document $D'$ (the definition of a different concept $X'$).
  • Figure 4: Retrieval performance on TheoQ. for Dream and Qwen2.5 models trained with varying amounts of ReasonAug data.
  • Figure 5: t-SNE visualization of document embeddings from ReasonAug. The documents are grouped and color-coded by concept (see legend in \ref{['color_mapping']}). Mathematical theorems include Vieta’s Formulas, Pigeonhole Principle, Euler’s Identity, and Central Limit Theorem. Algorithmic concepts include Two Pointers, N-Queens Problem, Sweep Line Algorithm, and Kahn’s Algorithm.
  • ...and 1 more figures