LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Parishad BehnamGhader; Vaibhav Adlakha; Marius Mosbach; Dzmitry Bahdanau; Nicolas Chapados; Siva Reddy

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy

TL;DR

This work presents LLM2Vec, a lightweight, unsupervised method to convert decoder-only large language models into universal text encoders by enabling bidirectional attention, applying masked next token prediction, and using unsupervised contrastive learning. The approach is evaluated on four decoders (1.3B–8B parameters) and yields state-of-the-art unsupervised embedding performance on MTEB, with additional gains when combined with supervised contrastive training. The authors provide extensive analyses of how the technique modulates representations, including surprising behavior in Mistral models that suggests pretraining bidirectional exposure. The method is parameter- and data-efficient, outperforming encoder-only baselines on word-level tasks and offering a practical path for using decoder-only LLMs for high-quality text embeddings without synthetic GPT-4 data.

Abstract

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 4 popular LLMs ranging from 1.3B to 8B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data (as of May 24, 2024). Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

TL;DR

Abstract

Paper Structure (69 sections, 5 equations, 9 figures, 12 tables)

This paper contains 69 sections, 5 equations, 9 figures, 12 tables.

Introduction
LLM2Vec
Three simple ingredients
Enabling bidirectional attention
Masked next token prediction
Unsupervised contrastive learning
Transforming decoder-only LLMs with LLM2Vec
Models
Training data
Masked next token prediction
Unsupervised contrastive learning
LLM2Vec-transformed models are strong unsupervised text embedders
Evaluation on word-level tasks
Setup
Results
...and 54 more sections

Figures (9)

Figure 1: The 3 steps of LLM2Vec. First, we enable bidirectional attention to overcome the restrictions of causal attention (Bi). Second, we adapt the model to use bidirectional attention by masked next token prediction training (MNTP). Third, we apply unsupervised contrastive learning with mean pooling to learn better sequence representations (SimCSE).
Figure 2: Evaluation of LLM2Vec-transformed models on word-level tasks. Solid and dashed horizontal lines show the performance of Uni and DeBERTa-v3-large, respectively.
Figure 3: Unsupervised results on our 15 task subset of the MTEB dataset. We ablate three different pooling choices: EOS, mean pooling, and weighted mean pooling. LLM2Vec is compatible with all three approaches and works best with mean pooling.
Figure 4: Cosine similarity between query ($q$) and negative ($s^-$) as well as positive examples ($s^+$). Plots for LLaMA-2-7B and other approaches are shown in \ref{['sec:appendix:analysis']}.
Figure 5: Cosine similarities at different token positions at layers when comparing representations constructed with causal attention to those constructed with bidirectional attention (without training). Additional plots are shown in \ref{['sec:appendix:analysis']}.
...and 4 more figures

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

TL;DR

Abstract

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Authors

TL;DR

Abstract

Table of Contents

Figures (9)