Table of Contents
Fetching ...

Extracting Sentence Embeddings from Pretrained Transformer Models

Lukas Stankevičius, Mantas Lukoševičius

TL;DR

The paper investigates how to derive high-quality sentence embeddings from pretrained transformer models without task-specific fine-tuning. It systematically evaluates a broad set of token-aggregation and post-processing techniques, including prompt-based templates, averaging across contexts, and combining contextual with static representations, across eight STS, six clustering, and twelve classification tasks. Key findings show very large gains on unsupervised STS and clustering from representation shaping, with simple Avg baselines and even random embeddings becoming competitive when paired with effective aggregation and post-processing; prompts offer mixed gains. The work highlights the importance of isotropy and post-processing alignment considerations, provides practical baselines (e.g., Avg. and RE), and demonstrates that strong, task-robust sentence embeddings can be built without task-specific fine-tuning, enabling efficient retrieval-augmented generation and broad applicability.

Abstract

Pre-trained transformer models shine in many natural language processing tasks and therefore are expected to bear the representation of the input sentence or text meaning. These sentence-level embeddings are also important in retrieval-augmented generation. But do commonly used plain averaging or prompt templates sufficiently capture and represent the underlying meaning? After providing a comprehensive review of existing sentence embedding extraction and refinement methods, we thoroughly test different combinations and our original extensions of the most promising ones on pretrained models. Namely, given 110 M parameters, BERT's hidden representations from multiple layers, and many tokens, we try diverse ways to extract optimal sentence embeddings. We test various token aggregation and representation post-processing techniques. We also test multiple ways of using a general Wikitext dataset to complement BERT's sentence embeddings. All methods are tested on eight Semantic Textual Similarity (STS), six short text clustering, and twelve classification tasks. We also evaluate our representation-shaping techniques on other static models, including random token representations. Proposed representation extraction methods improve the performance on STS and clustering tasks for all models considered. Very high improvements for static token-based models, especially random embeddings for STS tasks, almost reach the performance of BERT-derived representations. Our work shows that the representation-shaping techniques significantly improve sentence embeddings extracted from BERT-based and simple baseline models.

Extracting Sentence Embeddings from Pretrained Transformer Models

TL;DR

The paper investigates how to derive high-quality sentence embeddings from pretrained transformer models without task-specific fine-tuning. It systematically evaluates a broad set of token-aggregation and post-processing techniques, including prompt-based templates, averaging across contexts, and combining contextual with static representations, across eight STS, six clustering, and twelve classification tasks. Key findings show very large gains on unsupervised STS and clustering from representation shaping, with simple Avg baselines and even random embeddings becoming competitive when paired with effective aggregation and post-processing; prompts offer mixed gains. The work highlights the importance of isotropy and post-processing alignment considerations, provides practical baselines (e.g., Avg. and RE), and demonstrates that strong, task-robust sentence embeddings can be built without task-specific fine-tuning, enabling efficient retrieval-augmented generation and broad applicability.

Abstract

Pre-trained transformer models shine in many natural language processing tasks and therefore are expected to bear the representation of the input sentence or text meaning. These sentence-level embeddings are also important in retrieval-augmented generation. But do commonly used plain averaging or prompt templates sufficiently capture and represent the underlying meaning? After providing a comprehensive review of existing sentence embedding extraction and refinement methods, we thoroughly test different combinations and our original extensions of the most promising ones on pretrained models. Namely, given 110 M parameters, BERT's hidden representations from multiple layers, and many tokens, we try diverse ways to extract optimal sentence embeddings. We test various token aggregation and representation post-processing techniques. We also test multiple ways of using a general Wikitext dataset to complement BERT's sentence embeddings. All methods are tested on eight Semantic Textual Similarity (STS), six short text clustering, and twelve classification tasks. We also evaluate our representation-shaping techniques on other static models, including random token representations. Proposed representation extraction methods improve the performance on STS and clustering tasks for all models considered. Very high improvements for static token-based models, especially random embeddings for STS tasks, almost reach the performance of BERT-derived representations. Our work shows that the representation-shaping techniques significantly improve sentence embeddings extracted from BERT-based and simple baseline models.
Paper Structure (82 sections, 10 equations, 8 figures, 19 tables)

This paper contains 82 sections, 10 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: Relation between average Spearman correlation for STS tasks and IsoScore of Wikitext representations for each model. Pearson correlation coefficients are shown.
  • Figure 2: Relation between average clustering accuracy and IsoScore of Wikitext representations for each model. Pearson correlation coefficients are shown.
  • Figure 3: Alignment and uniformity of representations in relation to various token pooling and post-processing techniques. Lower values are better.
  • Figure 4: Layer-wise performance of templated models T0 (subfigures a, d, g) and T4 (b, e, h), as well as BERT versus RE with no weighting or post-processing (c, f, i). The average performance of STS (a, b, c), clustering (d, e, f), and classification tasks (g, h, i) is shown by the lines, while shadow areas correspond to the standard deviation. We also show first + last aggregation over layers as the last tick $1 \atop 12$ on the horizontal axis.
  • Figure 5: BERT + Avg. model performance dependence on the weight $w$ of Avg. model and layer, from which (for both models) representations are used. To the right of the black line on the horizontal axis, average aggregation of multiple layers is also shown. Tokens are simply averaged and no post-processing is used. The horizontal line with $w=0.0$ corresponds to a regular Bert (B) model, $w=0.5$ is B + Avg., and $w=1.0$ is the Avg. model. The white $\times$ marks the maximum value.
  • ...and 3 more figures