Table of Contents
Fetching ...

Learning Word Embedding with Better Distance Weighting and Window Size Scheduling

Chaohao Yang, Chris Ding

TL;DR

The paper tackles the lack of distance information in Word2Vec training by introducing two distance-aware techniques: Learnable Formulated Weights (LFW) for CBOW and Epoch-based Dynamic Window Size (EDWS) for Skip-gram. LFW defines a small-parametric, distance-based weighting scheme for context words, while EDWS replaces random dynamic windows with an epoch-progressive scheduling of window sizes, both aimed at capturing the influence of proximity on word prediction. Empirical results on enwik9 and text8 show substantial gains, with CBOW improvements up to 15.3% using LFW and Skip-gram improvements over 2.5% using EDWS, outperforming prior distance-informed approaches. The methods offer practical improvements for learning high-quality word embeddings with better syntactic and semantic distinctions and maintain training efficiency.

Abstract

Distributed word representation (a.k.a. word embedding) is a key focus in natural language processing (NLP). As a highly successful word embedding model, Word2Vec offers an efficient method for learning distributed word representations on large datasets. However, Word2Vec lacks consideration for distances between center and context words. We propose two novel methods, Learnable Formulated Weights (LFW) and Epoch-based Dynamic Window Size (EDWS), to incorporate distance information into two variants of Word2Vec, the Continuous Bag-of-Words (CBOW) model and the Continuous Skip-gram (Skip-gram) model. For CBOW, LFW uses a formula with learnable parameters that best reflects the relationship of influence and distance between words to calculate distance-related weights for average pooling, providing insights for future NLP text modeling research. For Skip-gram, we improve its dynamic window size strategy to introduce distance information in a more balanced way. Experiments prove the effectiveness of LFW and EDWS in enhancing Word2Vec's performance, surpassing previous state-of-the-art methods.

Learning Word Embedding with Better Distance Weighting and Window Size Scheduling

TL;DR

The paper tackles the lack of distance information in Word2Vec training by introducing two distance-aware techniques: Learnable Formulated Weights (LFW) for CBOW and Epoch-based Dynamic Window Size (EDWS) for Skip-gram. LFW defines a small-parametric, distance-based weighting scheme for context words, while EDWS replaces random dynamic windows with an epoch-progressive scheduling of window sizes, both aimed at capturing the influence of proximity on word prediction. Empirical results on enwik9 and text8 show substantial gains, with CBOW improvements up to 15.3% using LFW and Skip-gram improvements over 2.5% using EDWS, outperforming prior distance-informed approaches. The methods offer practical improvements for learning high-quality word embeddings with better syntactic and semantic distinctions and maintain training efficiency.

Abstract

Distributed word representation (a.k.a. word embedding) is a key focus in natural language processing (NLP). As a highly successful word embedding model, Word2Vec offers an efficient method for learning distributed word representations on large datasets. However, Word2Vec lacks consideration for distances between center and context words. We propose two novel methods, Learnable Formulated Weights (LFW) and Epoch-based Dynamic Window Size (EDWS), to incorporate distance information into two variants of Word2Vec, the Continuous Bag-of-Words (CBOW) model and the Continuous Skip-gram (Skip-gram) model. For CBOW, LFW uses a formula with learnable parameters that best reflects the relationship of influence and distance between words to calculate distance-related weights for average pooling, providing insights for future NLP text modeling research. For Skip-gram, we improve its dynamic window size strategy to introduce distance information in a more balanced way. Experiments prove the effectiveness of LFW and EDWS in enhancing Word2Vec's performance, surpassing previous state-of-the-art methods.
Paper Structure (10 sections, 9 equations, 2 figures, 4 tables)

This paper contains 10 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustrations of two previous methods for improving Word2Vec with example window size 3. The distance-related weights method in (a) combines all context words ($w_{t+i}$) with their corresponding weights ($\lambda_i$) when averaging them up to predict the center word ($w_t$). The dynamic window size strategy in (b) uses dynamically selected window sizes (indicated by the dashed box in the figure) to sample more from nearby context words, allowing them to contribute more to the training process.
  • Figure 2: The curves of normalized weights for power law decay and exponential decay