Table of Contents
Fetching ...

Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

Ruize An, Richong Zhang, Zhijie Nie, Zhanyu Wu, Yanzhao Zhang, Dingkun Long

TL;DR

Text2Token reframes unsupervised text representation learning as token target prediction, using an LLM decoder to map embeddings to vocabulary outputs and guiding training with a precomputed token target distribution. By designing two unsupervised token-target constructors—data-driven and model-derived—the approach leverages both surface-text tokens and semantically related tokens, trained via a two-stage KL-divergence objective that aligns vocabulary space with representation space. Experiments on MTEB v2 show Text2Token achieving state-of-the-art performance among unsupervised methods, surpassing LLM2Vec across multiple tasks and pooling/attention settings. The work provides evidence that jointly optimizing token-target distributions and representation spaces offers a new principled direction for generative TRL, with practical implications for retrieval, clustering, and beyond.

Abstract

Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web's unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods -- data-driven and model-derived -- to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.

Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

TL;DR

Text2Token reframes unsupervised text representation learning as token target prediction, using an LLM decoder to map embeddings to vocabulary outputs and guiding training with a precomputed token target distribution. By designing two unsupervised token-target constructors—data-driven and model-derived—the approach leverages both surface-text tokens and semantically related tokens, trained via a two-stage KL-divergence objective that aligns vocabulary space with representation space. Experiments on MTEB v2 show Text2Token achieving state-of-the-art performance among unsupervised methods, surpassing LLM2Vec across multiple tasks and pooling/attention settings. The work provides evidence that jointly optimizing token-target distributions and representation spaces offers a new principled direction for generative TRL, with practical implications for retrieval, clustering, and beyond.

Abstract

Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web's unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods -- data-driven and model-derived -- to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.

Paper Structure

This paper contains 45 sections, 20 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison between (a) traditional discriminative contrastive learning and (b) our proposed generative unsupervised framework: Text2Token.
  • Figure 2: The relation between the findings in nie2025text (left) and the new proposed training method in this work (right).
  • Figure 3: The overview of our generative framework, Text2Token, for unsupervised text representation learning.
  • Figure 4: Ablation results of single-stage training.
  • Figure 5: The result variation with the hyperparameter.
  • ...and 3 more figures