A Comprehensive Empirical Evaluation of Existing Word Embedding Approaches
Obaidullah Zaland, Muhammad Abulaish, Mohd. Fazil
TL;DR
The paper investigates how existing word embedding approaches compare on downstream classification tasks, distinguishing traditional co-occurrence models from neural, context-aware methods. It provides a comprehensive extrinsic evaluation across multiple datasets, analyzing factors such as window size, embedding dimension, dataset size, data balance, and preprocessing, and contrasts trained versus pre-trained vectors. Key findings show that contextual models like BERT and ELMo often yield the best performance, especially on harder tasks, but pre-trained models can be competitive on smaller corpora or when resources are limited; subword information substantially improves OOV handling. The work offers practical guidance on model choice and parameter settings, emphasizing the importance of data characteristics and preprocessing for achieving strong results in real-world NLP classification tasks.
Abstract
Vector-based word representations help countless Natural Language Processing (NLP) tasks capture the language's semantic and syntactic regularities. In this paper, we present the characteristics of existing word embedding approaches and analyze them with regard to many classification tasks. We categorize the methods into two main groups - Traditional approaches mostly use matrix factorization to produce word representations, and they are not able to capture the semantic and syntactic regularities of the language very well. On the other hand, Neural-network-based approaches can capture sophisticated regularities of the language and preserve the word relationships in the generated word representations. We report experimental results on multiple classification tasks and highlight the scenarios where one approach performs better than the rest.
