Table of Contents
Fetching ...

A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama

TL;DR

The paper reframes language representation learning as mutual information maximization, using the InfoNCE lower bound to derive a unifying objective that encompasses Skip-gram, BERT, and XLNet. It introduces InfoWord, a principled self-supervised framework that combines a discounted mutual-information term over sentence-wide representations with a DIM-inspired objective over $n$-grams, optimized via negative sampling. Empirical results on GLUE and SQuAD show that InfoWord yields stronger representations than prior MLM-based approaches, especially with limited data, and highlight the value of span-based and higher-order views. The work provides a bridging perspective across NLP, computer vision, and related domains, suggesting flexible future directions for incorporating richer views and priors into mutual-information-based pretraining.

Abstract

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

A Mutual Information Maximization Perspective of Language Representation Learning

TL;DR

The paper reframes language representation learning as mutual information maximization, using the InfoNCE lower bound to derive a unifying objective that encompasses Skip-gram, BERT, and XLNet. It introduces InfoWord, a principled self-supervised framework that combines a discounted mutual-information term over sentence-wide representations with a DIM-inspired objective over -grams, optimized via negative sampling. Empirical results on GLUE and SQuAD show that InfoWord yields stronger representations than prior MLM-based approaches, especially with limited data, and highlight the value of span-based and higher-order views. The work provides a bridging perspective across NLP, computer vision, and related domains, suggesting flexible future directions for incorporating richer views and priors into mutual-information-based pretraining.

Abstract

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

Paper Structure

This paper contains 25 sections, 9 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The left plot shows $F_1$ scores of BERT-NCE and InfoWord as we increase the percentage of training examples on SQuAD (dev). The right plot shows $F_1$ scores of InfoWord on SQuAD (dev) as a function of $\lambda_{\text{DIM}}$.