Table of Contents
Fetching ...

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Richard Diehl Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery, Lisa Beinborn

TL;DR

This work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency and presents a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training.

Abstract

Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensional cone, rather than spreading out over their representational capacity. Our work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency. We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training. Our Syntactic Smoothing method adjusts the maximum likelihood objective function to distribute the learning signal to syntactically similar tokens. This approach results in better performance on infrequent English tokens and a decrease in anisotropy. We empirically show that the degree of anisotropy in a model correlates with its frequency bias.

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

TL;DR

This work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency and presents a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training.

Abstract

Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensional cone, rather than spreading out over their representational capacity. Our work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency. We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training. Our Syntactic Smoothing method adjusts the maximum likelihood objective function to distribute the learning signal to syntactically similar tokens. This approach results in better performance on infrequent English tokens and a decrease in anisotropy. We empirically show that the degree of anisotropy in a model correlates with its frequency bias.

Paper Structure

This paper contains 29 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of the BLiMP frequency bias calculation used to evaluate a model's reliance on frequency statistics when making predictions. The example BLiMP values are from a baseline RoBERTa model.
  • Figure 2: Part-of-speech distributions and similarity distributions for the subword tokens "blind" and "the". Similarities are computed as cosine-similarities against every other token in the vocabulary and sorted.
  • Figure 3: Frequency bias plotted for the three open source pre-trained models, our base model, the two label smoothing (LS) baselines and our two Syntactic Smoothing (SyS) models.
  • Figure 4: Anisotropy learning dynamics plotted for the baseline RoBERTa model, the two label smoothing (LS) baselines and our Syntactic Smoothing (SyS) models. Values in parentheses indicate the degree of smoothing.
  • Figure 5: Anisotropy learning dynamics plotted for the baseline model and the paced Syntactic Smoothing model with high smoothing, across some of the models' layers. We highlight the difference in anisotropy of the final layer across the two models at the end of training.
  • ...and 2 more figures