Table of Contents
Fetching ...

Revisiting Word Embeddings in the LLM Era

Yash Mahajan, Matthew Freestone, Naman Bansal, Sathyanarayanan Aakur, Shubhra Kanti Karmaker Santu

TL;DR

This work systematically compares classic word embeddings with embeddings derived from large language models (LLMs) in both decontextualized and contextualized settings. In decontextualized mode, certain LLMs (e.g., ADA, PaLM) show strong analogy performance and clustering of related terms, rivaling some classical baselines. In contextualized mode, contrastive sentence encoders like SBERT and SIMCSE often outperform LLMs on sentence-level semantics, while LLaMA-family models demonstrate strong token-level contextualization. By introducing anchor-based variational tasks and a synthetic dataset, the paper reveals a nuanced landscape where LLMs excel at word-level context but classical models remain competitive for sentence-level semantics, guiding practitioners in model selection and highlighting avenues for future research in interpretable, context-aware embeddings.

Abstract

Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.

Revisiting Word Embeddings in the LLM Era

TL;DR

This work systematically compares classic word embeddings with embeddings derived from large language models (LLMs) in both decontextualized and contextualized settings. In decontextualized mode, certain LLMs (e.g., ADA, PaLM) show strong analogy performance and clustering of related terms, rivaling some classical baselines. In contextualized mode, contrastive sentence encoders like SBERT and SIMCSE often outperform LLMs on sentence-level semantics, while LLaMA-family models demonstrate strong token-level contextualization. By introducing anchor-based variational tasks and a synthetic dataset, the paper reveals a nuanced landscape where LLMs excel at word-level context but classical models remain competitive for sentence-level semantics, guiding practitioners in model selection and highlighting avenues for future research in interpretable, context-aware embeddings.

Abstract

Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.
Paper Structure (30 sections, 17 figures, 4 tables)

This paper contains 30 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: The distribution of cosine similarities between all pairs of words for each model.
  • Figure 2: Violin box plot showing the distribution of cosine similarities for random, morphologically related, and semantically related pairs of words for each model.
  • Figure 3: Spearman's $\rho$ for each model pair, calculated from $~2.1 B$ randomly selected word pairs out of a total of $6.4 B$ word pairs from the Wordnet (RQ1) corpus.
  • Figure 4: Mean-Variance plot of the difference in Word Pair Similarity Ranks for the BATS corpus. For all other model comparisons refer to appendix figure \ref{['fig:model-agreement-apx']}.
  • Figure 5: For each model, the cosine similarity of related words was found and ranked according to all pairs of words. Here, the difference in ranking between model pairs for certain BATS categories is shown.(Continued)
  • ...and 12 more figures