The Role of $n$-gram Smoothing in the Age of Neural Networks

Luca Malagutti; Andrius Buinovskij; Anej Svete; Clara Meister; Afra Amini; Ryan Cotterell

The Role of $n$-gram Smoothing in the Age of Neural Networks

Luca Malagutti, Andrius Buinovskij, Anej Svete, Clara Meister, Afra Amini, Ryan Cotterell

TL;DR

The paper addresses how to reintegrate classical $n$-gram smoothing into neural language modeling by establishing a formal link between add-$\lambda$ smoothing and label smoothing and proposing a general framework to convert any $n$-gram smoothing into a differentiable regularizer for neural LMs. It presents a two-step view where smoothing modifies the empirical $n$-gram distribution to $\tilde{p}_{\mathcal{D}}^n$ and neural models are trained by minimizing $D_{KL}(\tilde{p}_{\mathcal{D}}^n \| q_{\boldsymbol{\theta}})$, which is shown to be equivalent to a regularized objective $D_{KL}(p_{\mathcal{D}} \| q_{\boldsymbol{\theta}}) + \mathcal{R}(\boldsymbol{\theta})$. The framework is instantiated with four smoothing methods—Good–Turing, Jelinek–Mercer, Katz, and Kneser–Ney—deriving corresponding regularizers and demonstrating on WikiText-2 and IWSLT-14 that several regularizers outperform label smoothing and sometimes add-$\lambda$ smoothing in language modeling and machine translation. This work offers a practical pathway to incorporate classic smoothing principles into neural NLP, highlighting improvements and trade-offs in data-scarce settings and outlining scalability considerations for larger datasets.

Abstract

For nearly three decades, language models derived from the $n$-gram assumption held the state of the art on the task. The key to their success lay in the application of various smoothing techniques that served to combat overfitting. However, when neural language models toppled $n$-gram models as the best performers, $n$-gram smoothing techniques became less relevant. Indeed, it would hardly be an understatement to suggest that the line of inquiry into $n$-gram smoothing techniques became dormant. This paper re-opens the role classical $n$-gram smoothing techniques may play in the age of neural language models. First, we draw a formal equivalence between label smoothing, a popular regularization technique for neural language models, and add-$λ$ smoothing. Second, we derive a generalized framework for converting any $n$-gram smoothing technique into a regularizer compatible with neural language models. Our empirical results find that our novel regularizers are comparable to and, indeed, sometimes outperform label smoothing on language modeling and machine translation.

The Role of $n$-gram Smoothing in the Age of Neural Networks

TL;DR

The paper addresses how to reintegrate classical

-gram smoothing into neural language modeling by establishing a formal link between add-

smoothing and label smoothing and proposing a general framework to convert any

-gram smoothing into a differentiable regularizer for neural LMs. It presents a two-step view where smoothing modifies the empirical

-gram distribution to

and neural models are trained by minimizing

, which is shown to be equivalent to a regularized objective

. The framework is instantiated with four smoothing methods—Good–Turing, Jelinek–Mercer, Katz, and Kneser–Ney—deriving corresponding regularizers and demonstrating on WikiText-2 and IWSLT-14 that several regularizers outperform label smoothing and sometimes add-

smoothing in language modeling and machine translation. This work offers a practical pathway to incorporate classic smoothing principles into neural NLP, highlighting improvements and trade-offs in data-scarce settings and outlining scalability considerations for larger datasets.

Abstract

For nearly three decades, language models derived from the

-gram assumption held the state of the art on the task. The key to their success lay in the application of various smoothing techniques that served to combat overfitting. However, when neural language models toppled

-gram models as the best performers,

-gram smoothing techniques became less relevant. Indeed, it would hardly be an understatement to suggest that the line of inquiry into

-gram smoothing techniques became dormant. This paper re-opens the role classical

-gram smoothing techniques may play in the age of neural language models. First, we draw a formal equivalence between label smoothing, a popular regularization technique for neural language models, and add-

smoothing. Second, we derive a generalized framework for converting any

-gram smoothing technique into a regularizer compatible with neural language models. Our empirical results find that our novel regularizers are comparable to and, indeed, sometimes outperform label smoothing on language modeling and machine translation.

Paper Structure (34 sections, 8 theorems, 45 equations, 1 figure, 6 tables)

This paper contains 34 sections, 8 theorems, 45 equations, 1 figure, 6 tables.

Introduction
Label Smoothing and add-$\lambda$ Smoothing
Preliminaries
Some Notation.
Maximum-likelihood Estimation.
Counting Substrings in a Dataset.
Empirical Distributions.
Prefix Probabilities.
Label Smoothing of $n$-gram LMs
Smoothing n-Gram Counts
Good--Turing (good_turing)
Jelinek--Mercer (jelinek)
Katz (katz)
Kneser--Essen--Ney
A Generalized Framework
...and 19 more sections

Key Result

Theorem 2.2

Let ${p}$ and ${q}$ be two language models over ${\Sigma}$ and ${\pi}$ the prefix probability function of ${p}$. Furthermore, we assume that ${{\mathrm{H}}}(p, q) < \infty$. Then, the following equality holds

Figures (1)

Figure 1: An illustration of the introduced framework. With maximum-likelihood estimation (MLE), a language model ${q_{{{{ \boldsymbol{\theta}}}}}}$ is trained to match ${p_{{\mathcal{D}}}}$, the empirical distribution induced by a dataset ${{\mathcal{D}}}$. However, we can also modify (smooth) ${p_{{\mathcal{D}}}}$ into ${{\tilde{p}}_{{\mathcal{D}}}^n}$ and train a language model ${{\tilde{q}}_{\boldsymbol{\theta}}}$ on ${{\tilde{p}}_{{\mathcal{D}}}^n}$. We show that the latter can be thought of as training ${{\tilde{q}}_{\boldsymbol{\theta}}}$ with a regularized maximum-likelihood objective.

Theorems & Definitions (17)

Definition 2.1
Theorem 2.2
proof
Corollary 2.2
proof
Theorem 2.4
proof
Theorem 4.1
proof
Theorem A.1
...and 7 more

The Role of $n$-gram Smoothing in the Age of Neural Networks

TL;DR

Abstract

The Role of $n$-gram Smoothing in the Age of Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (17)