A Comprehensive Survey of Sentence Representations: From the BERT Epoch to the ChatGPT Era and Beyond
Abhinav Ramesh Kashyap, Thanh-Tung Nguyen, Viktor Schlegel, Stefan Winkler, See-Kiong Ng, Soujanya Poria
TL;DR
This survey catalogs the evolution of sentence representations from traditional word- and sentence-embedding approaches to modern deep-learning and LLM-driven methods. It organizes the literature along supervised versus unsupervised paradigms, and across data, model, transform, and loss components, highlighting contrastive learning as a central thread while noting post-processing and data-centric innovations. Key findings show strong gains from simple data augmentation (e.g., dropout-based SimCSE) and data-generation strategies with LLMs, but persistent challenges include cross-lingual transfer, domain generalization, and the universality of representations beyond semantics. The work underscores the practical impact of sentence representations in retrieval and contextual reasoning for LLMs, while advocating advances in multilingual, multi-domain, and task-general representations to better integrate with upcoming AI systems.
Abstract
Sentence representations are a critical component in NLP applications such as retrieval, question answering, and text classification. They capture the meaning of a sentence, enabling machines to understand and reason over human language. In recent years, significant progress has been made in developing methods for learning sentence representations, including unsupervised, supervised, and transfer learning approaches. However there is no literature review on sentence representations till now. In this paper, we provide an overview of the different methods for sentence representation learning, focusing mostly on deep learning models. We provide a systematic organization of the literature, highlighting the key contributions and challenges in this area. Overall, our review highlights the importance of this area in natural language processing, the progress made in sentence representation learning, and the challenges that remain. We conclude with directions for future research, suggesting potential avenues for improving the quality and efficiency of sentence representations.
