Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

Yixuan Tang; Yi Yang

Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

Yixuan Tang, Yi Yang

TL;DR

The paper tackles how pooling and attention choices shape LLM-based embeddings by conducting a controlled, large-scale comparison using the same data and base models. It systematically evaluates five design combinations, introduces Multi-Layers Trainable Pooling that fuses information from all hidden layers via cross-attention, and provides statistical significance tests across tasks in MTEB. Key findings show bidirectional attention and a trainable pooling layer improve semantic textual similarity and retrieval, while multi-layer pooling yields broad gains under bidirectional context; however, no single design dominates clustering or classification. The work offers a practical pooling strategy that balances performance and task dependency, and provides replication-ready code and robustness checks with an alternative base model.

Abstract

The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.

Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

TL;DR

Abstract

Paper Structure (22 sections, 3 figures, 13 tables)

This paper contains 22 sections, 3 figures, 13 tables.

Introduction
Commonly Used Pooling and Attention Strategies
Pooling Strategy
Attention Strategy
Fine-Tuning LLMs as Embedding Models
Multi-Layers Trainable Pooling: A New Pooling Strategy that Obtains Embedding From Multiple Layers
Last layer vs. Other layers
Multi-Layers Trainable Pooling
Pooling and Attention Experiments
Pooling and Attention Combinations
Experimental Details
Experimental Details
Empirical Analysis
What is the Optimal Pooling Strategy?
What is the Optimal Attention Strategy?
...and 7 more sections

Figures (3)

Figure 1: The correlation heatmap of EOS token hidden states across different layers. The two figures on the left are measured on Mistral-7B-v0.1 using the HotpotQA and MS MARCO datasets, while the two figures on the right are measured on Meta-Llama-3-8B. Areas shaded in blue indicate low correlation, while areas shaded in red denote high correlation. The horizontal axis represents the layer index ranging from 0 to 31, while the vertical axis represents the layer index ranging from 31 to 0.
Figure 2: Performance of different hidden layers from Mistral-7B-v0.1 (left) and Llama3-8B (right) on the MTEB benchmark. The highest score is marked with a star. The X-axis represents the layer index ranging from 0 to 31, and the Y-axis represents the performance score.
Figure 3: The proposed pooling method: Multi-Layers Trainable Pooling. It combines the EOS token hidden states from all layers in the LLM and transforms them into the final embedding using a cross-attention network.

Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

TL;DR

Abstract

Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)