Revisiting Word Embeddings in the LLM Era
Yash Mahajan, Matthew Freestone, Naman Bansal, Sathyanarayanan Aakur, Shubhra Kanti Karmaker Santu
TL;DR
This work systematically compares classic word embeddings with embeddings derived from large language models (LLMs) in both decontextualized and contextualized settings. In decontextualized mode, certain LLMs (e.g., ADA, PaLM) show strong analogy performance and clustering of related terms, rivaling some classical baselines. In contextualized mode, contrastive sentence encoders like SBERT and SIMCSE often outperform LLMs on sentence-level semantics, while LLaMA-family models demonstrate strong token-level contextualization. By introducing anchor-based variational tasks and a synthetic dataset, the paper reveals a nuanced landscape where LLMs excel at word-level context but classical models remain competitive for sentence-level semantics, guiding practitioners in model selection and highlighting avenues for future research in interpretable, context-aware embeddings.
Abstract
Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.
