From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
Charles Zhang, Benji Peng, Xintian Sun, Qian Niu, Junyu Liu, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Ming Liu, Yichao Zhang, Xinyuan Song, Cheng Fei, Caitlyn Heqi Yin, Lawrence KQ Yan, Hongyang He, Tianyang Wang
TL;DR
This survey maps the trajectory from sparse word representations to rich multimodal embeddings, emphasizing contextualized models (ELMo, BERT, GPT), cross-lingual and personalized adaptations, and the integration of vision, robotics, and neuroscience. It details foundational concepts, sentence/document embedding strategies, and cross-l lingual techniques, while highlighting compression, interpretability, numerical reasoning, and ethical considerations as critical research gaps. The work also discusses grounding language models in non-textual modalities, advances in robotics and cognitive science, and future directions for scalable training, bias mitigation, and adaptive learning. Overall, the paper provides a comprehensive framework for understanding embedding-based LLMs and outlines pragmatic avenues for scalable, interpretable, and groundable AI systems.
Abstract
Word embeddings and language models have transformed natural language processing (NLP) by facilitating the representation of linguistic elements in continuous vector spaces. This review visits foundational concepts such as the distributional hypothesis and contextual similarity, tracing the evolution from sparse representations like one-hot encoding to dense embeddings including Word2Vec, GloVe, and fastText. We examine both static and contextualized embeddings, underscoring advancements in models such as ELMo, BERT, and GPT and their adaptations for cross-lingual and personalized applications. The discussion extends to sentence and document embeddings, covering aggregation methods and generative topic models, along with the application of embeddings in multimodal domains, including vision, robotics, and cognitive science. Advanced topics such as model compression, interpretability, numerical encoding, and bias mitigation are analyzed, addressing both technical challenges and ethical implications. Additionally, we identify future research directions, emphasizing the need for scalable training techniques, enhanced interpretability, and robust grounding in non-textual modalities. By synthesizing current methodologies and emerging trends, this survey offers researchers and practitioners an in-depth resource to push the boundaries of embedding-based language models.
