Table of Contents
Fetching ...

A Comprehensive Empirical Evaluation of Existing Word Embedding Approaches

Obaidullah Zaland, Muhammad Abulaish, Mohd. Fazil

TL;DR

The paper investigates how existing word embedding approaches compare on downstream classification tasks, distinguishing traditional co-occurrence models from neural, context-aware methods. It provides a comprehensive extrinsic evaluation across multiple datasets, analyzing factors such as window size, embedding dimension, dataset size, data balance, and preprocessing, and contrasts trained versus pre-trained vectors. Key findings show that contextual models like BERT and ELMo often yield the best performance, especially on harder tasks, but pre-trained models can be competitive on smaller corpora or when resources are limited; subword information substantially improves OOV handling. The work offers practical guidance on model choice and parameter settings, emphasizing the importance of data characteristics and preprocessing for achieving strong results in real-world NLP classification tasks.

Abstract

Vector-based word representations help countless Natural Language Processing (NLP) tasks capture the language's semantic and syntactic regularities. In this paper, we present the characteristics of existing word embedding approaches and analyze them with regard to many classification tasks. We categorize the methods into two main groups - Traditional approaches mostly use matrix factorization to produce word representations, and they are not able to capture the semantic and syntactic regularities of the language very well. On the other hand, Neural-network-based approaches can capture sophisticated regularities of the language and preserve the word relationships in the generated word representations. We report experimental results on multiple classification tasks and highlight the scenarios where one approach performs better than the rest.

A Comprehensive Empirical Evaluation of Existing Word Embedding Approaches

TL;DR

The paper investigates how existing word embedding approaches compare on downstream classification tasks, distinguishing traditional co-occurrence models from neural, context-aware methods. It provides a comprehensive extrinsic evaluation across multiple datasets, analyzing factors such as window size, embedding dimension, dataset size, data balance, and preprocessing, and contrasts trained versus pre-trained vectors. Key findings show that contextual models like BERT and ELMo often yield the best performance, especially on harder tasks, but pre-trained models can be competitive on smaller corpora or when resources are limited; subword information substantially improves OOV handling. The work offers practical guidance on model choice and parameter settings, emphasizing the importance of data characteristics and preprocessing for achieving strong results in real-world NLP classification tasks.

Abstract

Vector-based word representations help countless Natural Language Processing (NLP) tasks capture the language's semantic and syntactic regularities. In this paper, we present the characteristics of existing word embedding approaches and analyze them with regard to many classification tasks. We categorize the methods into two main groups - Traditional approaches mostly use matrix factorization to produce word representations, and they are not able to capture the semantic and syntactic regularities of the language very well. On the other hand, Neural-network-based approaches can capture sophisticated regularities of the language and preserve the word relationships in the generated word representations. We report experimental results on multiple classification tasks and highlight the scenarios where one approach performs better than the rest.
Paper Structure (23 sections, 18 equations, 3 figures, 10 tables)

This paper contains 23 sections, 18 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: CBOW predicts the center word using the context words, while skipgram predicts the context words using the center word - Figure from the original word2vec paper by Mikolov
  • Figure 2: The BERT input representations are obtained by summing the token embeddings, position embeddings, and the segmentation embeddings - Figure from BERT original paper by Devlin et al.,
  • Figure 3: The BERT training phases. Apart from output layers, both pre-training and fine-tuning use the same architecture. During fine-tuning, the model is initialized with pre-trained model parameters and then fine-tuned for task in hand. Figure from the BERT original paper.