Table of Contents
Fetching ...

A Fast and Simple Algorithm for Training Neural Probabilistic Language Models

Andriy Mnih, Yee Whye Teh

TL;DR

The paper tackles the slow training of neural probabilistic language models caused by full vocabulary normalization. It introduces noise-contrastive estimation (NCE) as a stable, sample-efficient alternative to maximum likelihood and importance sampling, enabling efficient learning of unnormalized language models by reframing training as binary discrimination between data and noise samples. Empirically, NCE delivers large speedups on Penn Treebank (approximately 14× faster than ML with equivalent perplexities) and scales to large corpora (47M words, 80k vocabulary), achieving state-of-the-art performance on the Microsoft Research Sentence Completion Challenge. The work demonstrates that unigram noise is effective and that diagonal context matrices can further accelerate training, suggesting broad applicability of NCE to unnormalized models and large-class probabilistic classifiers.

Abstract

In spite of their superior performance, neural probabilistic language models (NPLMs) remain far less widely used than n-gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Training NPLMs is computationally expensive because they are explicitly normalized, which leads to having to consider all words in the vocabulary when computing the log-likelihood gradients. We propose a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions. We investigate the behaviour of the algorithm on the Penn Treebank corpus and show that it reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. The algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well. We demonstrate the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary, obtaining state-of-the-art results on the Microsoft Research Sentence Completion Challenge dataset.

A Fast and Simple Algorithm for Training Neural Probabilistic Language Models

TL;DR

The paper tackles the slow training of neural probabilistic language models caused by full vocabulary normalization. It introduces noise-contrastive estimation (NCE) as a stable, sample-efficient alternative to maximum likelihood and importance sampling, enabling efficient learning of unnormalized language models by reframing training as binary discrimination between data and noise samples. Empirically, NCE delivers large speedups on Penn Treebank (approximately 14× faster than ML with equivalent perplexities) and scales to large corpora (47M words, 80k vocabulary), achieving state-of-the-art performance on the Microsoft Research Sentence Completion Challenge. The work demonstrates that unigram noise is effective and that diagonal context matrices can further accelerate training, suggesting broad applicability of NCE to unnormalized models and large-class probabilistic classifiers.

Abstract

In spite of their superior performance, neural probabilistic language models (NPLMs) remain far less widely used than n-gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Training NPLMs is computationally expensive because they are explicitly normalized, which leads to having to consider all words in the vocabulary when computing the log-likelihood gradients. We propose a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions. We investigate the behaviour of the algorithm on the Penn Treebank corpus and show that it reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. The algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well. We demonstrate the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary, obtaining state-of-the-art results on the Microsoft Research Sentence Completion Challenge dataset.

Paper Structure

This paper contains 11 sections, 15 equations, 3 tables.